Methods and machine learning systems for predicting the likelihood or risk of having cancer

ABSTRACT

Embodiments of the present invention relate generally to non-invasive methods and tests that measure biomarkers (e.g., tumor antigens) and collect clinical parameters from patients, and computer-implemented machine learning methods, apparatuses, systems, and computer-readable media for assessing a likelihood that a patient has a disease, relative to a patient population or a cohort population. In one embodiment, a classifier is generated using a machine learning system based on training data from retrospective data and subset of inputs (e.g. at least two biomarkers and at least one clinical parameter), wherein each input has an associated weight and the classifier meets a predetermined Receiver Operator Characteristic (ROC) statistic, specifying a sensitivity and a specificity, for correct classification of patients. The classifier may then be used to assesses the likelihood that a patient has cancer relative to a population by classify the patient into a category indicative of a likelihood of having cancer or into another category indicative of a likelihood of not having cancer.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a continuation-in-part of PCT/US15/64344 filed 7Dec. 2015, which claims the benefit of U.S. Provisional PatentApplication No. 62/089,061, filed 8 Dec. 2014, the contents of which areeach incorporated herein by reference in their entirety.

TECHNICAL FIELD

Present invention embodiments relate generally to using an artificialintelligence/machine learning system for analyzing data and makingpredictions based upon the data, and more specifically, to predictingthe likelihood or risk for having a disease such as cancer, especiallyin an otherwise asymptomatic or vaguely symptomatic patient.

BACKGROUND Early Detection of Cancer

For many types of cancers, patient outcomes improve significantly ifsurgery and other therapeutic interventions commence before the tumorhas metastasized. Accordingly, imaging and diagnostic tests have beenintroduced into medical practice in an attempt to help physicians detectcancer early. These include various imaging modalities such asmammography as well as diagnostic tests to identify cancer specific“biomarkers” in the blood and other bodily fluids such as the prostatespecific antigen (PSA) test. The value of many of these tests is oftenquestioned particularly with regard to whether the costs and risksassociated with false positives, false negatives, etc. outweigh thepotential benefits in terms of actual lives saved. Furthermore, in orderto demonstrate this value, data from large numbers of patients—manythousands or even tens of thousands—must be generated in real world(prospective) studies rather laboratory stored (retrospective) studies.Unfortunately, the costs of conducting large prospective studies forscreening tools is outweighed by reasonably anticipated financialreturns so these large prospective studies are almost never done by theprivate sector and are only occasionally sponsored by governments. As aresult, the use paradigms for blood testing for the early detection ofmost cancers has progressed little in several decades. In the UnitedStates, for example, PSA remains the only widely utilized blood test forcancer screening and even its utilization has become controversial. Itother parts of the world, especially the Far East, blood tests fordetecting various cancers is more commonplace but there is littlestandardization or empirical methods to ascertain or improve theaccuracy of such testing in those parts of the world.

It would therefore be desirable to improve the accuracy andstandardization of cancer screening in those regions where it is commonand, in so doing, generate tools and technologies that may improveand/or encourage cancer screening in those regions where it is lesscommon.

Cancer detection poses significant technical challenges as compared todetecting viral or bacterial infections since cancer cells, unlikeviruses and bacteria, are biologically similar to and hard todistinguish from normal, healthy cells. For this reason, tests used forthe early detection of cancer often suffer from higher numbers of falsepositives and false negatives than comparable tests for viral orbacterial infections or for tests that measure genetic, enzymatic, orhormonal abnormalities. This often causes confusion among healthcarepractitioners and their patients leading in some cases to unnecessary,expensive, and invasive follow-up testing while in other cases to acomplete disregard for follow-up testing resulting in cancers beingdetected too late for useful intervention. Physicians and patientswelcome tests that yield a binary decision or result, e.g., either thepatient is positive or negative for a condition, such as observed in theover the counter pregnancy test kits which present, for example, animmunoassay result in the shape of a plus sign or a negative sign as anindication of pregnancy or not. However, unless the sensitivity andspecificity of diagnosis approaches 99%, a level not obtainable for mostcancer tests, such binary outputs can be highly misleading orinaccurate.

It would therefore be desirable to provide healthcare practitioners andtheir patients with more quantitative information about their likelihoodof having a particular cancer, even if a binary output is not practical.

Detecting early stage cancer is also challenging due to factorsassociated with the modern day practice of medicine. Primary careproviders in particular typically see a high volume of patients per dayand the demands of healthcare cost containment has dramaticallyshortened the amount of time they can spend with each patient.Accordingly, physicians often lack sufficient time to take in depthfamily and lifestyle histories, to counsel patients on healthylifestyles, or to follow-up with patients who have been recommendedtesting beyond that which is provided in their office practice.

It would therefore be desirable to provide high-volume primary careproviders, in particular, with useful tools to help them triage orcompare the relative risks for their patients of having cancer so theycan order additional testing for those patients at the highest risks.

Lung Cancer and Early Detection

Lung cancer is by far the leading cause of cancer deaths in NorthAmerica and in most of the world killing more people than the next threemost lethal cancers combined, namely breast, prostate, and colorectalcancer. Lung cancer results in over 156,000 deaths per year in theUnited States alone (American Cancer Society. Cancer Facts & FIGS. 2011.Atlanta: American Cancer Society; 2011). Tobacco use has been identifiedas a primary causal factor for lung cancer and is thought to account forsome 90% of cases. Thus, individuals over 50 years of age with a smokinghistory of greater than 20 pack-years have a 1 in 7 lifetime risk ofdeveloping the disease. Lung cancer is a relatively silent diseasedisplaying few if any specific symptoms until it reaches the later moreadvanced stages. Therefore, most patients are not diagnosed until aftertheir cancer has metastasized beyond the lung and the cancer is nolonger treatable by surgery alone. Thus, while the best way to preventlung cancer is likely tobacco avoidance or cessation, for many currentand former smokers, the transforming, cancer-causing event has alreadyoccurred and even though the cancer is not yet manifest, the damage hasalready been done. Thus, perhaps the most effective means of reducinglung cancer mortality is early stage detection when the tumor is stilllocalized and amenable to surgery with intent to cure.

The importance of early detection was recently demonstrated in a large7-year clinical study, the National Lung Cancer Screening Trial (NLST),which compared chest x-ray and chest computed topography (CT) scanningas potential modalities for the early detection of lung cancer (NationalLung Screening Trial Research Team, Aberle D. R., Adams A. M., Berg C.D., Black W. C., Clapp J. D., Fagerstrom R. M., Gareen I. F., GatsonisC., Marcus P. M., Sicks J. D. Reduced lung-cancer mortality withlow-dose computed tomographic screening. N. Engl. J. Med. 2011 Aug. 4;365(5):395-409). The trial concluded that the use of chest CT scans toscreen the at-risk population identified significantly more early stagelung cancers than chest x-rays and resulted in a 20% overall reductionin disease mortality. This study has clearly indicated that identifyinglung cancer early can save lives. Unfortunately, the broad applicationof CT scanning as a screening method for lung cancer is problematic. TheNLST design utilized a serial CT screening paradigm in which patientsreceived a CT scan annually for only three years. Nearly 40% of theparticipants receiving the annual CT scan over 3 years had at least onepositive screening result and 96.4% of these positive screening resultswere false positives. This very high rate of false positives can causepatient anxiety and place a burden on the healthcare system, as thework-up following a positive finding on low-dose CT scans often includesadvanced imaging and biopsies. Although CT scanning is an important toolfor the early detection of lung cancer, more than two years after theNLST results were announced, very few patients at high risk for lungcancer due to smoking history have initiated a program of annual CTscans. This reluctance to undergo yearly CT scans is likely due to anumber of factors including costs, perceived risks of radiationexposure, especially by serial CT scans, the inconvenience or burden toasymptomatic patients of scheduling a separate diagnostics procedure ata radiology center, as well as concerns by physicians that the very highfalse positive rates of CT scanning as a standalone test will result ina significant number of unnecessary follow up diagnostic tests andinvasive procedures.

While the overall lifetime risk for lung cancer amongst smokers is high,the chance that any individual smoker has cancer at a specific point intime is on the order of 1.5-2.7% [Bach, P. B., et al., Screening forLung Cancer*ACCP Evidence-Based Clinical Practice Guidelines (2ndEdition). CHEST Journal, 2007. 132(3_suppl): p. 69S-77S.]. Due to thislow disease prevalence, identifying which patients are at highest riskis challenging and complex.

It would be desirable to have blood tests to compliment use ofradiographic screening for the early detection of lung cancer.

Artificial Intelligence/Machine Learning Systems

Artificial intelligence/machine learning systems are useful foranalyzing information, and may assist human experts in decision making.For example, machine learning systems comprising diagnosticdecision-support systems may use clinical decision formulas, rules,trees, or other processes for assisting a physician with making adiagnosis.

Although decision-making systems have been developed, such systems arenot widely used in medical practice because these systems suffer fromlimitations that prevent them from being integrated into the day-to-dayoperations of health organizations. For example, decision-making systemsmay provide an unmanageable volume of data, rely on analysis that ismarginally significant, and not correlate well with complexmultimorbidity (Greenhalgh, T. Evidence based medicine: a movement incrisis? BMJ (2014) 348:g3725)

Many different healthcare workers may see a patient, and patient datamay be scattered across different computer systems in both structuredand unstructured form. Also, the systems are difficult to interact with(Berner, 2006; Shortliffe, 2006). The entry of patient data isdifficult, the list of diagnostic suggestions may be too long, and thereasoning behind diagnostic suggestions is not always transparent.Further, the systems are not focused enough on next actions, and do nothelp the clinician figure out what to do to help the patient(Shortliffe, 2006).

It would, therefore, be desirable to provide methods and technologies topermit artificial intelligence/machine learning systems to be used toaid in the early detection of cancer, especially with blood testing.

SUMMARY

Embodiments of the present invention relate generally to non-invasivemethods, diagnostic tests, especially blood (including serum or plasma)tests that measure biomarkers (e.g. tumor antigens), andcomputer-implemented machine learning methods, apparatuses, systems, andcomputer-readable media for assessing a likelihood that a patient has adisease, such as cancer, relative to a patient population or a cohortpopulation to determine whether that patient should be followed up withadditional, more invasive testing.

In embodiments are provided a computer implemented method for predictinga likelihood of having cancer in a patient, in a computer system havingone or more processors coupled to a memory storing one or more computerreadable instructions for execution by the one or more processors, theone or more computer readable instructions comprising instructions for:storing a set of data comprising a plurality of patient records, eachpatient record including a plurality of parameters and correspondingvalues for a patient, and wherein the set of data also includes adiagnostic indicator indicating whether or not the patient has beendiagnosed with cancer. In embodiments, the patient records areretrospective data which includes both a diagnosis and patient data suchas measured biomarkers and clinical parameters. The computer implementedmethods comprises selecting a subset of the plurality of parameters forinputs into a machine learning system, wherein the subset includes apanel of at least two different biomarkers and at least one clinicalparameter; randomly partitioning the set of data into training data andvalidation data; generating a classifier using a machine learning systembased on the training data and the subset of inputs, wherein each inputhas an associated weight; and determining whether the classifier meets apredetermined Receiver Operator Characteristic (ROC) statistic,specifying a sensitivity and a specificity, for correct classificationof patients.

In embodiments, the predetermined ROC statistic is a sensitivity of atleast 70% with at least an 80% specificity. In certain embodiments, thesensitivity, with an 80% specificity, is at least 75%, 80%, 82%, 85%,87%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97% or 98%. In otherembodiments, the sensitivity, with an 85% specificity, is at least 70%,75%, 80%, 82%, 85%, 87%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97% or 98%.In embodiments, the sensitivity, with an 90% specificity, is at least70%, 75%, 80%, 82%, 85%, 87%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97% or98%.

In embodiments, the computer implemented method further comprisesiteratively regenerating the classifier when the classifier does notmeet the predetermined ROC statistic, by using a different subset ofinputs and/or by adjusting the associated weights of the inputs untilthe regenerated classifier meets the predetermined ROC statistic. Incertain embodiments, the computer implemented method further comprisesgenerating a static configuration of the classifier when the machinelearning system meets the predetermined ROC statistic. The classifiermay be used, for example by a physician, when the classifier is static,semi-static (e.g. the classifier may be updated at designated intervals)or dynamic (e.g. the classifier is updated as additional data for apatient is inputted into the system included as a diagnosis). Typically,a diagnosis for the presence of cancer is confirmed with radiographicscreening and/or by histology of a biopsy sample.

In embodiments, the method comprises classifying the validation datausing the classifier; determining whether the classifier meets thepredetermined ROC statistic; and when the classifier does not meet thepredetermined ROC statistic, iteratively regenerating the classifier byusing a different subset of inputs and/or by adjusting the associatedweights of the inputs, until the regenerated classifier meets thepredetermined ROC statistic. In embodiments, the method furthercomprising configuring a computing device accessible by a user with thestatic classifier; entering values for the subset of the plurality ofparameters corresponding to a patient into the computing device; andclassifying, using the static classifier, the patient into a categoryindicative of a likelihood of having cancer or into another categoryindicative of a likelihood of not having cancer.

In embodiments, the category indicative of a likelihood of having canceris further categorized into qualitative groups such as low, medium,high, or some combination or sub-combination thereof. In alternativeembodiments, the category indicative of a likelihood of having cancer isfurther categorized into quantitative groups. Those quantitative groupsmay be provided to the user as a percentage, multiplier value, compositescore or risk score for the likelihood of having cancer or an increasedrisk of having cancer. In certain embodiments, the methods furthercomprise providing a notification to the user recommending diagnostictesting when the patient is classified into the category indicative of alikelihood of having cancer. In embodiments, the diagnostic testing isradiographic screening or analysis of a biopsy sample.

In embodiments, wherein the classifier is updated, the method furthercomprises obtaining test results from the diagnostic testing whichconfirm or deny the presence of cancer, incorporating the test resultsinto the training data for further training of the machine learningsystem; and generating an improved classifier by the machine learningsystem.

In embodiments, the biomarkers may be any two, any three, any four, anyfive, or any six or more biomarkers associated with the presence ofcancer. In embodiments, the panel of biomarkers is selected from thegroup consisting of: AFP, CA125, CA 15-3, CA 19-19, CEA, CYFRA 21-1,HE-4, NSE, Pro-GRP, PSA, SCC, anti-Cyclin E2, anti-MAPKAPK3,anti-NY-ESO-1, and anti-p53. In embodiments, a sample is obtained from apatent for measurement of biomarkers wherein the sample is sample isblood, blood serum, blood plasma, or a component thereof. Inembodiments, the clinical parameters may be one or more of age; gender;smoking status (e.g. lung cancer); number of pack years; symptoms;family history of cancer; concomitant illnesses; number of nodules (e.g.pulmonary nodules); size of nodules; and imaging data. See Example 4 fora ranking of biomarkers and clinical factors for lung cancer. Inembodiments, clinical parameters for lung cancer include smoking status,pack years, and age. In certain embodiments, clinical parameters forlung cancer include an age of at least 50; and at least a 20 pack yearsmoking history.

In embodiments, the classifier is a support vector machine, a decisiontree, a random forest, a neural network, or a deep learning neuralnetwork. In certain embodiments, the classifier is a neural net that hasany one or more of the following features: at least two hidden layers;at least two outputs, with a first output indicating that lung cancer islikely and a second output indicating that lung cancer is not likely;and 20-30 nodes. See Example 3 for training of a neural net withretrospective patient data with lung cancer.

In embodiments, the cancer is selected from the group consisting of:breast cancer, bile duct cancer, bone cancer, cervical cancer, coloncancer, colorectal cancer, gallbladder cancer, kidney cancer, liver orhepatocellular cancer, lobular carcinoma, lung cancer, melanoma, ovariancancer, pancreatic cancer, prostate cancer, skin cancer, and testicularcancer. In illustrative embodiments, the cancer is lung cancer.

In embodiments, a computer implemented method for predicting alikelihood of cancer in a subject is provided using a computer systemhaving one or more processors coupled to a memory storing one or morecomputer readable instructions for execution by the one or moreprocessors, the one or more computer readable instructions comprisinginstructions for: storing a set of data comprising a plurality ofpatient records, each patient record including a plurality of parametersfor a patient, and wherein the set of data also includes a diagnosticindicator indicating whether or not the patient has been diagnosed withcancer; selecting a plurality of parameters for inputs into a machinelearning system, wherein the parameters include a panel of at least twodifferent biomarker values and at least one type of clinical data; andgenerating a classifier using the machine learning system, wherein theclassifier comprises a sensitivity of at least 70% and a specificity ofat least 80%, and wherein the classifier is based on a subset of theinputs.

In other embodiments are provided use of the classifier in a method ofassessing the likelihood that a patient has lung cancer relative to apopulation comprising measuring the values of a panel of biomarkers in asample from a patient and obtaining clinical parameters from thepatient; utilizing a classifier generated by a machine learning systemto classify the patient into a category indicative of a likelihood ofhaving cancer or into another category indicative of a likelihood of nothaving cancer, wherein the classifier comprises a sensitivity of atleast 70% and a specificity of at least 80%, and wherein the classifieris generated using a panel of biomarkers comprising at least twodifferent biomarkers, and at least one clinical parameter; and when apatient is classified into a category indicating a likelihood of havingcancer, providing a notification to a user for diagnostic testing.

In other embodiments, techniques are provided for the use of artificialintelligence/machine learning systems that can incorporate and analyzestructured and preferably also unstructured data to perform a riskanalysis to determine a likelihood for having cancer, initially lungcancer, but also, other types of cancer, including pan-cancer testing(i.e. testing of multiple tumors from a single patient sample). Byutilizing algorithms generated from the biomarker levels (e.g. tumorantigens) from large volumes of longitudinal or prospectively collectedblood samples (e.g., real world data from one or more regions whereblood based tumor biomarker cancer screening is commonplace) togetherwith one or more clinical parameters (e.g. age, smoking history, diseasesigns or symptoms) a risk level or percentage of that patient having acancer type is provided. The machine learning system determines aquantifiable risk for the presence of cancer in patients, preferablybefore they have symptoms or advanced disease, in terms of an increaseover the population (e.g., a cohort population). By determining anindividual patient's risk relative to the cohort, physicians mayrecommend further follow-up testing (e.g. radiography) for thosepatients who are at higher risks relative to the cohort population andalso hope to change patient's behavior which may be increasing the riskof cancer.

In another embodiment, in addition to the aforementioned biomarkerlevels and one more clinical parameters, the biomarker change over timefollowing serial testing—“velocity”—is included in the algorithm.

In yet another embodiment, in addition to the aforementioned biomarkerlevels and one more clinical parameters, environment and or occupational(workplace) exposure to carcinogens is included in the algorithm.

In yet another embodiment, in addition to the aforementioned biomarkerlevels and one more clinical parameters, the patient's personal familyhistory of cancer is included in the algorithm.

In yet another embodiment, in addition to the aforementioned biomarkerlevels and one more clinical parameters, published information from themedical and scientific literature is included in the algorithm asunstructured data.

According to embodiments of the present invention, a machine learningsystem utilizes a plurality of data sources, determines which types ofdata from the data sources are most predictive for determining a risk ofhaving cancer, and outputs a likelihood (e.g., in the form of apercentage risk score or a multiplier, etc.) of developing cancerrelative to a population or a cohort population. Instead of simplymaking a determination of the risk of cancer based upon a single markeror multiple biomarkers, wherein the concentrations of the biomarker(s)are evaluated with respect to fixed threshold concentration(s), themachine learning system may also optionally consider a plurality ofdifferent types of data including electronic medical records (EMRs),publically available data, biomarkers, biomarker velocities, and otherfactors associated with the development of cancer to generate thelikelihood of having cancer. The risk of the presence of cancer in agiven individual may be quantified in terms of an increase over otherindividuals in the same risk population (e.g., cohort population). Riskrelative to a cohort population provides a clear and quantitative way ofproviding a risk for developing cancer, while avoiding a binary orabsolute “yes” or “no” result associated with false positives ornegatives. By using more than one neural net in the system to determinewhich risk factors are the most important (e.g., most predictive), animproved manner of determining which patients are at increased risk ofhaving cancer may be achieved.

Other more specific embodiments of the invention may include a bloodtest for assessing a likelihood that a patient has lung cancer relativeto a population or a cohort population of individuals, e.g., individualsof a similar age range and smoking history. In this example, one or morebiomarkers are analyzed from the patient's fluid sample, e.g., a bloodsample, which is used, at least in part, to determine a biomarkercomposite score and a risk score as compared to a cohort population,known to have lung cancer as well as non-cancer and other controls. Thispermits the patient's risk of having lung cancer to be categorized usingidentifiers as low, intermediate, high, very high, etc. As sufficientdata is generated, the system will calculate a risk percentage, as wellas a margin of error. Based on this information, physicians and otherhealthcare practitioners, patients, and health insurance companies, canbetter determine which patients are most likely to benefit fromfollow-up testing, including CT screening. Such a method reduces thecosts, anxiety, and radiation exposure associated with having lower riskpatients undergo CT scans while helping to ensure that patients athigher risk of having lung cancer undergo CT scanning in hopes ofdetecting the tumor at an early stage when curative surgery is anoption.

According to another specific embodiment of the invention, theaforementioned artificial intelligence/machine learning system may beused to enhance or improve a blood test for the simultaneous detectionof multiple tumor types from a single blood or serum sample. Such“pan-cancer” tests are common in the Far East such as the test disclosedby Y.-H. Wen, et al. “Cancer screening through a multi-analyte serumbiomarker panel during health check-up examinations; Results from a12-year experience,” Clinica Chimica Acta 450 (2015) 273-276. As anotherexample, six biomarkers, CEA, CYFRA, SCC, CA 15.3, NSE and ProGRP, wereidentified that were related to the presence of lung cancer [Molina, R.et al. “Assessment of a Combined Panel of Six Serum Tumor Markers forLung Cancer”, Am. J. Respir. Crit. Care Med. (2015)]. The real world,prospective, raw patient data generated in Taiwan that was used tocreate that published report could be used, for example, to generate analgorithm according to the present invention that would improve testingboth in the region or clinical center where the test was run as well asin regions of the world where such screening paradigms are less common(e.g. the United States).

These and other advantages of the techniques presented herein may bebetter understood by referring to the following description,accompanying drawings and claims. The embodiments presented herein, setout below to enable one to practice an implementation of the invention,are intended to be non-limiting. Those skilled in the art should readilyappreciate that the conceptions and specific embodiments disclosedherein may be used as a basis for modifying or designing other methodsand systems for carrying out the same purposes of the present invention.Those skilled in the art should also realize that such equivalentassemblies do not depart from the spirit and scope of the invention inits broadest form.

BRIEF DESCRIPTION OF THE FIGURES

The numerous advantages of the present invention may be betterunderstood by those skilled in the art by reference to the accompanyingfigures in which:

FIGS. 1A-1B are schematic diagrams of an example computing environmentin accordance with example embodiments.

FIGS. 2A-2B are illustrations of example neural net systems, inaccordance with example embodiments.

FIG. 3 is a flow diagram illustrating operations for identification andcorrection of problematic data, in accordance with example embodiments.

FIGS. 4A-4B are flow diagrams illustrating operations for determining arisk of having cancer, in accordance with example embodiments.

FIG. 5 is a flow diagram illustrating operations for extraction of data,in accordance with example embodiments.

FIG. 6 is a flow diagram illustrating operations for interfacing withpublicly accessible sources of data, in accordance with exampleembodiments.

FIG. 7 is a schematic diagram illustrating a client and a computing nodeof an artificial intelligence system in accordance with exampleembodiments.

FIG. 8 is a schematic diagram illustrating a cloud computing environmentfor an artificial intelligence system in accordance with exampleembodiments.

FIG. 9 is a schematic diagram illustrating an abstraction of computingmodel layers in accordance with example embodiments.

FIG. 10 shows an example of a risk categorization table for a diseasesuch as lung cancer. In this risk categorization table, the inflectionpoint between having a risk greater than the observed risk of smokers of2% occurs with an aggregate MoM score of above 9. With an aggregatescore of 9 or less, that patient has a risk of lung cancer no greaterthan does any other heavy smoker not yet diagnosed. A MoM score greaterthan 9 indicates a greater risk of cancer or a higher likelihood ofcancer as compared to the smoking population.

FIG. 11 is a flow diagram of example operations for utilizing a machinelearning system to construct a cohort population, in accordance withexample embodiments.

FIG. 12 is a flow diagram of example operations for utilizing a machinelearning system to classify an individual patient, in accordance withexample embodiments.

FIG. 13 is an example illustration of a neural net with at least twobiomarker inputs and at least one clinical data input, with two levelsof hidden layers and two outputs, in accordance with exampleembodiments.

FIG. 14 is a flow diagram of example operations of generating anartificial neural net to predict a likelihood of having cancer, inaccordance with example embodiments.

FIGS. 15A-D show various receiver operator characteristic (ROC) curvesusing various statistical and machine learning approaches, in accordancewith example embodiments.

FIGS. 16A and 16B show the distribution of test scores in a patientcohort conforming to specific test inclusion criteria (older than 50years, current and former smokers, greater than 20 pack years) usingrandom forest analysis on a panel of markers (age, smoking status, packyears, COPD, CA-125, CEA, CYFRA and anti-NYESO), in accordance withexample embodiments.

FIG. 17 shows a ROC curve analysis for discrimination of lung cancer andbenign modules based on a MLR model (3 biomarkers and 3 clinicalfactors), in accordance with example embodiments.

FIG. 18 shows a histogram of the nodule size in lung cancer cases andcontrols (benign nodules).

FIG. 19 shows ROC curves for each of the three nodule subgroups based onMLR models, in accordance with example embodiments.

FIG. 20 shows a probability of lung cancer in accordance with exampleembodiments.

DETAILED DESCRIPTION

Embodiments of the present invention relate generally to non-invasivemethods, diagnostic tests, especially blood (including serum or plasma)tests that measure biomarkers (e.g. tumor antigens) in combination withclinical parameters, and computer-implemented machine learning methods,apparatuses, systems, and computer-readable media for assessing alikelihood that a patient has a disease, such as cancer, relative to apatient population or a cohort population to determine whether thatpatient should be followed up with additional, more invasive testing.

A. Introduction

Embodiments of the present invention provide for non-invasive methods,diagnostic tests, and computer-implemented machine learning methods,apparatuses, systems, and computer-readable media for assessing alikelihood that a patient has a disease, such as cancer, relative to apopulation or a cohort population by generating, e.g., stratified riskcategories to more accurately predict the presence of cancer in anotherwise asymptomatic or vaguely symptomatic patient.

As used herein “machine learning” refers to algorithms that give acomputer the ability to learn without being explicitly programmedincluding algorithms that learn from and make predictions about data.Machine learning algorithms include, but are not limited to, decisiontree learning, artificial neural networks (ANN) (also referred to hereinas a “neural net”), deep learning neural network, support vectormachines, rule base machine learning, random forest, etc. For thepurposes of clarity, algorithms such as linear regression or logisticregression can be used as part of a machine learning process. However,it is understood that using linear regression or another algorithm aspart of a machine learning process is distinct from performing astatistical analysis such as regression with a spreadsheet program suchas Excel. The machine learning process has the ability to continuallylearn and adjust the classifier as new data becomes available, and doesnot rely on explicit or rules-based programming. Statistical modelingrelies on finding relationships between variables (e.g., mathematicalequations) to predict an outcome.

In the present invention, the machine learning algorithms are “trained”by building a model from inputs. Those inputs may be retrospective datawith a known diagnosis of cancer (including matched controls) and datafrom measured biomarkers and clinical factors of those patients. SeeExample 3 for training of an ANN using retrospective lung cancer patientdata. In that instance the classifier, the trained machine learningalgorithm, can classify new patient data into a category indicative of alikelihood of having cancer or into another category indicative of alikelihood of not having cancer. The category indicative of a likelihoodof having cancer can be further divided into qualitative or quantitativesub-groups. Those qualitative groups may include identifiers such aslow, medium, intermediate, high, or a combination thereof for alikelihood of having cancer. The quantitative groups may includeidentifiers such as a percentage, multiplier value, risk score,composite score or any numerical value that can be provided to the userfor indicating the likelihood of having cancer. Those quantitative andqualitative groups may also be presented in a table, such as a “riskcategorization table” as disclosed herein.

For example, according to one aspect of the present invention, a riskcategorization of a population or cohort population of individuals isused to determine a quantified risk level for the presence of a cancerin an asymptomatic human subject. In some aspects, data used todetermine the risk level may include, but is not limited to, a bloodtest that measures multiple biomarkers in the blood (only once orpreferably serially to measure changes over time), a patient's medicalrecords and person history such as smoking, as well as publicallyavailable sources of information pertaining to cancer risk. In certainembodiments, the risk categorization is herein referred to as a riskcategorization table. As used herein, the term “table” is used in itsbroadest sense to refer to a grouping of data into a format providingfor ease of interpretation or presentation, this includes, but is notlimited to data provided from execution of computer program instructionsor a software application, a table, a spreadsheet, etc. Thus, in oneembodiment the risk categorization table is a grouping of a stratifiedpopulation or cohort population (e.g., a human subject population). Thisstratification of human subjects is based on analysis of retrospectiveclinical samples (and may include other data) from subjects diagnosed ashaving cancer wherein the actual incidence of cancer, herein referred toas the positive predictive score (PPS) is determined for each stratifiedgrouping. Ideally, the data from the population or cohort is collectedon a longitudinal or prospective basis whereupon the determination ofthe presence or absence of cancer is made after the blood sample istaken and the biomarkers have been measured. Data collected in thismanner can often overcome various limitations and biases inherent inretrospective studies which measure biomarkers in stored or archivedsamples already classified as being from cancer patients (“cases”)versus patients without apparent cancers (“controls”). The data used tocreate the quantified risk levels preferably comes from very largenumbers of patients, more than one thousand, more than ten thousand, oreven more than one-hundred thousand patients. (Means for continuousimprovements to the risk algorithms and tables using machine learningsystems are described in the sections that follow.) The PPS is thenconverted to a multiplier indicating an increased likelihood of havingthe cancer by dividing the PPS by the reported incidence of cancer inthe population or cohort of the population subject to stratification,(e.g., human subjects 50 years or older). Each grouping or cohortgrouping is given a risk categorization identifier, including, but notlimited to, low risk, intermediate-low risk, intermediate risk,intermediate-high risk and highest risk. Thus, in one embodiment, eachcategory of the risk categorization table comprises 1) an increasedlikelihood of having the cancer, 2) a risk identifier and 3) a range ofcomposite scores.

It is understood that the basis for the stratification of a populationor of a cohort of a population of human subjects is based on, at leastin part, 1) an identification of a certain cancer, 2) biomarkers thatare associated with the cancer, (3) clinical parameter data, and in somecases, (4) publically available data including risk factors for havingthe cancer. A cohort shares the same cancer risk factors as theasymptomatic individual. Validation of the biomarkers to be used in thepresent methods may be provided by analyzing retrospective cancersamples along with age matched normal (non-cancer) samples and/or othercontrols. But, as stated above, prospective validation is better.

The present invention further provides a machine learning system,methods and computer readable media for analyzing results from a panelof biomarkers for a cancer along with data from a patient's medicalrecord, and other publically available sources of information, andquantifying a human subject's increased risk (or in certaincircumstances decreased risk) for the presence of the cancer in anasymptomatic human subject relative to a population. As used herein, theterm “increased risk” refers to an increase for the presence of thecancer as compared to the known prevalence of that particular canceracross the population cohort. The present methods are based on thegeneration of a risk categorization table for a certain cancer; whereinthere is no intended limitation on when this table is generated. Thus,the present method and risk categorization table is based, at least inpart, on 1) the identification and clustering of a set of proteinsand/or resulting autoantibodies to those proteins that can serve asmarkers for the presence of a cancer, 2) normalization and aggregationof the markers measured to generate a biomarker composite score; and, 3)medical data for a patient and other publically available sources ofdata for risk factors for having cancer; and (4) determination ofthreshold values used to divide patients into groups with varyingdegrees of risk for the presence of cancer in which the likelihood of anasymptomatic human subject having a quantified increased risk for thepresence of the cancer is determined. A machine learning system may beutilized to determine the best cohort grouping as well as determine howbiomarker composite data, medical data and other data are to be combinedin order to generate a risk categorization in an optimal or near-optimalmanner, e.g., correctly predicting which individuals have cancer with alow false positive rate. The machine learning system yields a numericalrisk score for each patient tested, which can be used by physicians tomake treatment decisions concerning the therapy of cancer patients or,importantly, to further inform screening procedures to better predictand diagnose early stage cancer in asymptomatic patients. Also, asdescribed in more detail herein, the machine learning system is adaptedto receive additional data as the system is used in a real-worldclinical setting and to recalculate and improve the risk categories andalgorithm so that the system becomes “smarter” the more that it is used.

B. Definitions

As used herein, the terms “a” or “an” are used, as is common in patentdocuments, to include one or more than one, independent of any otherinstances or usages of “at least one” or “one or more.”

As used herein, the term “or” is used to refer to a nonexclusive or,such that “A or B” includes “A but not B,” “B but not A,” and “A and B,”unless otherwise indicated.

As used herein, the term “about” is used to refer to an amount that isapproximately, nearly, almost, or in the vicinity of being equal to oris equal to a stated amount, e.g., the state amount plus/minus about 5%,about 4%, about 3%, about 2% or about 1%.

As used herein, the term “asymptomatic” refers to a patient or humansubject that has not previously been diagnosed with the same cancer thattheir risk of having is now being quantified and categorized. Forexample, human subjects may show signs such as coughing, fatigue, pain,etc., but have not been previously diagnosed with lung cancer but arenow undergoing screening to categorize their increased risk for thepresence of cancer and for the present methods are still considered“asymptomatic”.

As used herein, the term “AUC” refers to the Area Under the Curve, forexample, of a ROC Curve. That value can assess the merit of a test on agiven sample population with a value of 1 representing a good testranging down to 0.5 which means the test is providing a random responsein classifying test subjects. Since the range of the AUC is only 0.5 to1.0, a small change in AUC has greater significance than a similarchange in a metric that ranges for 0 to 1 or 0 to 100%. When the %change in the AUC is given, it will be calculated based on the fact thatthe full range of the metric is 0.5 to 1.0. A variety of statisticspackages can calculate AUC for an ROC curve, such as, JMP™ orAnalyse-It™. AUC can be used to compare the accuracy of theclassification algorithm across the complete data range. Classificationalgorithms with greater AUC have, by definition, a greater capacity toclassify unknowns correctly between the two groups of interest (diseaseand no disease). The classification algorithm may be the measure of asingle molecule or as complex as the measure and integration of multiplemolecules.

As used herein, the terms “biological sample” and “test sample” refer toall biological fluids and excretions isolated from any given subject. Inthe context of embodiments of the present invention such samplesinclude, but are not limited to, blood, blood serum, blood plasma,urine, tears, saliva, sweat, biopsy, ascites, cerebrospinal fluid, milk,lymph, bronchial and other lavage samples, or tissue extract samples. Incertain embodiments, blood, serum, plasma and bronchial lavage or otherliquid samples are convenient test samples for use in the context of thepresent methods.

As used herein, the terms “cancer” and “cancerous” refer to or describethe physiological condition in mammals that is typically characterizedby unregulated cell growth. Examples of cancer include but are notlimited to, lung cancer, breast cancer, colon cancer, prostate cancer,hepatocellular cancer, gastric cancer, pancreatic cancer, cervicalcancer, ovarian cancer, liver cancer, bladder cancer, cancer of theurinary tract, thyroid cancer, renal cancer, carcinoma, melanoma, andbrain cancer.

As used herein, the term “cancer risk factors” refers to biological orenvironmental influences that are known risks associated with aparticular cancer. These cancer risk factors include, but are notlimited to, a family history of cancer (e.g., breast cancer), age,weight, sex, history of smoking tobacco, environmental factors (e.g.,exposure to asbestos, exposure to radiation, etc.), occupational riskfactors (e.g., coal miner, hazmat worker, etc.), genetic factors andmutations, and so forth. It is understood that these cancer riskfactors, either individually or a combination thereof, contribute toselecting a cohort of the population used to develop a RiskCategorization Table and that this same cohort is then tested using thepresent methods and machine learning system to determine their increasedrisk for the presence of cancer as compared to the known prevalence ofcancer across the cohort. In certain embodiments, cancer risk factorsfor lung cancer are a human subject aged 50 years or older with ahistory of smoking tobacco.

As used herein, the term “cohort” or “cohort population” refers to agroup or segment of human subjects with shared factors or influences,such as age, family history, cancer risk factors, environmentalinfluences, medical histories, etc. In one instance, as used herein, a“cohort” refers to a group of human subjects with shared cancer riskfactors; this is also referred to herein as a “disease cohort”. Inanother instance, as used herein, a “cohort” refers to a normalpopulation group matched, for example by age, to the cancer risk cohort;also referred to herein as a “normal cohort”. A “same cohort” refers toa group of human subjects having the same shared cancer risk factors asthe individual undergoing assessment for a risk of having a disease suchas cancer.

As used herein, the term “normalized” refers to data that has beennormalized by any normalization technique known in the art, includingbut not limited to MoM, standard deviation normalization, sigmoidalnormalization, etc.

As used herein, the term “environmental database” refers to a databasecomprising environmental risk factors for cancer, including but notlimited to location, zip code. For patients who have lived or worked ata particular location for a number of years, the environmental databasemay be able to indicate whether those locations are associated with thepresence of cancer. Information from the database may be based onjournal articles, scientific studies, etc.

As used herein, the term “employment database” or “occupationaldatabase” refers to a database comprising occupational risk factors forcancer. Such data includes, but is not limited to, occupations known tobe associated with the development of cancer, chemicals or carcinogensthat a person employed in a particular occupation is likely toencounter, correlation between number of years in an occupation and risk(e.g., employment in an occupation for 5 years has a 5% increase in therisk of cancer, employment in the same occupation for 10 years has a 55%increase in the risk of cancer as compared to other occupations, etc.)

As used herein, the term “population database” refers to a databasecomprising demographics (e.g., gender, age, smoking history, familyhistory, blood tests, biomarker tests, etc.) for a population ofindividuals. This data is supplied to a neural net for cohort analysis,and the neural net identifies the factors most predictive of thepresence of cancer.

As used herein, the term “genetic database” refers to a databasecomprising information linking various types of genetic information tothe presence of cancer (e.g., BRAF, V600E mutation, EGFP, gene SNPS,etc.)

As used herein, the term “raw images” refers to imaging studies prior toprocessing, e.g., XRAYs, CT scans, MRI, EEG, ECG, ultrasound etc.

As used herein, the term “medical history” refers to any type of medicalinformation associated with a patient. In some embodiments, the medicalhistory is stored in an electronic medical records database. Medicalhistory may include clinical data (e.g., imaging modalities, blood work,biomarkers, cancerous samples and control samples, labs, etc.), clinicalnotes, symptoms, severity of symptoms, number of years smoking, familyhistory of a disease, history of illness, treatment and outcomes, an ICDcode indicating a particular diagnosis, history of other diseases,radiology reports, imaging studies, reports, medical histories, geneticrisk factors identified from genetic testing, genetic mutations, etc.

As used herein, the term “converted numeric fields” refers to numericdata that has been extracted by natural language processing fromunstructured data (e.g., years of smoking, frequency, etc.)

As used herein, the term “unstructured data” refers to text, free formtext, etc. For example, unstructured data may include patient notesentered by a physician, annotations accompanying imaging studies, etc.

As used herein, the term “composite score” refers to an aggregation ofthe normalized values for the predetermined markers measured in thesample from the human subject and clinical parameter values. When usedin the context of the risk categorization table and correlated to astratified population grouping or cohort population grouping based on arange of composite scores in the Risk Categorization Table, the“composite score” is used, at least in part, by the machine learningsystem to determine the “risk score” for each human subject testedwherein the numerical value (e.g., a multiplier, a percentage, etc.)indicating increased likelihood of having the cancer for the stratifiedgrouping becomes the “risk score”. See, FIG. 10.

As used herein, the term “master composite score” refers to a compositescore generated by the master neural net system, which includes one ormore of biomarker composite scores, medical history, publicallyavailable sources of data related to cancer risk, etc., and is used todetermine a risk category (e.g., low, medium, high, etc.) as well as toquantify risk for an individual.

In certain aspects the “cohort score” is also referred to herein as the“test score”.

As used herein, the terms “differentially expressed gene,” “differentialgene expression” and their synonyms, which are used interchangeably, areused in the broadest sense and refer to a gene and/or resulting proteinwhose expression is activated to a higher or lower level in a subjectsuffering from a disease, specifically cancer, such as lung cancer,relative to its expression in a normal or control subject. The termsalso include genes whose expression is activated to a higher or lowerlevel at different stages of the same disease. It is also understoodthat a differentially expressed gene may be either activated orinhibited at the nucleic acid level or protein level, or may be subjectto alternative splicing to result in a different polypeptide product.Such differences may be evidenced by a change in mRNA levels, surfaceexpression, secretion or other partitioning of a polypeptide, forexample. Differential gene expression may include a comparison ofexpression between two or more genes or their gene products (e.g.,proteins), or a comparison of the ratios of the expression between twoor more genes or their gene products, or even a comparison of twodifferently processed products of the same gene, which differ betweennormal subjects and subjects suffering from a disease, specificallycancer, or between various stages of the same disease. Differentialexpression includes both quantitative, as well as qualitative,differences in the temporal or cellular expression pattern in a gene orits expression products among, for example, normal and diseased cells,or among cells which have undergone different disease events or diseasestages.

As used herein, the term “gene expression profiling” is used in thebroadest sense, and includes methods of quantification of mRNA and/orprotein levels in a biological sample.

As used herein, the term “large volume of patients” is used in thebroadest sense, and includes a number of patients including, e.g.,several hundred patients, a thousand patients, several thousandpatients, ten thousand patients, several tens of thousands of patients,and so forth, with any amount in between. In some embodiments, thenumber of patients is a number sufficient to train the system.

As used herein, the term “increased risk” refers to an increase in therisk level, for a human subject after biomarker testing and/or dataanalysis by the machine learning system, for the presence of a cancerrelative to a population's known prevalence of a particular cancerbefore testing. In other words, a human subject's risk for cancer beforebiomarker testing and/or data analysis may be 2% (based on theunderstood prevalence of cancer in the population), but after biomarkertesting and/or data analysis (based on the measure of one or more ofbiomarker concentration, a patient's medical data, public sources ofdata, etc.) the patient's risk for the presence of cancer may be 30% oralternatively reported as an increase of 15 times compared to thecohort. The machine learning system calculates the 30% risk of havingthe cancer and the increased risk of 15 times relative to the populationor cohort population is provided in more detail herein. It is alsocontemplated, as will be apparent from the present risk categorizationtable and accompanying machine learning system, that it is possible thatthe re-categorization of a patient's risk for the presence of a cancerresults in a risk that is less than the known prevalence of a particularcancer across a population or cohort population. For example, a humansubject's risk for cancer before biomarker testing and/or data analysismay be 2% (based on the understood prevalence of cancer in thepopulation), but after biomarker testing and/or data analysis (based onthe measure of biomarkers and the patient's medical data and otherdata), their risk for the presence of cancer may be 1% or alternativelyreported as an increase of 0.5 times compared to the cohort population.In this instance, “increased risk” refers to a change in risk levelrelative to a population before testing.

As used herein, the term “decreased risk” refers to a decrease in therisk level, for a human subject after biomarker testing and/or dataanalysis, for the presence of a cancer relative to a population's knownprevalence of a particular cancer before testing. In this instance,“decreased risk” refers to a change in risk level relative to apopulation before testing.

As used herein, the term “lung cancer” refers to a cancer stateassociated with the pulmonary system of any given subject. In thecontext of another embodiment of the present invention, lung cancersinclude, but are not limited to, adenocarcinoma, epidermoid carcinoma,squamous cell carcinoma, large cell carcinoma, small cell carcinoma,non-small cell carcinoma, and bronchioalveolar carcinoma. Within thecontext of another embodiment of the present invention, lung cancers maybe at different stages, as well as varying degrees of grading. Methodsfor determining the stage of a lung cancer or its degree of grading arewell known to those skilled in the art.

As used herein, the terms “marker”, “biomarker” (or fragment thereof)and their synonyms, which are used interchangeably, refer to moleculesthat can be evaluated in a sample and are associated with a physicalcondition. For example, markers include expressed genes or theirproducts (e.g., proteins) or autoantibodies to those proteins that canbe detected from human samples, such as blood, serum, solid tissue, andthe like, that is associated with a physical or disease condition. Suchbiomarkers include, but are not limited to, biomolecules comprisingnucleotides, amino acids, sugars, fatty acids, steroids, metabolites,polypeptides, proteins (such as, but not limited to, antigens andantibodies), carbohydrates, lipids, hormones, antibodies, regions ofinterest which serve as surrogates for biological molecules,combinations thereof (e.g., glycoproteins, ribonucleoproteins,lipoproteins) and any complexes involving any such biomolecules, suchas, but not limited to, a complex formed between an antigen and anautoantibody that binds to an available epitope on said antigen. Theterm “biomarker” can also refer to a portion of a polypeptide (parent)sequence that comprises at least 5 consecutive amino acid residues,preferably at least 10 consecutive amino acid residues, more preferablyat least 15 consecutive amino acid residues, and retains a biologicalactivity and/or some functional characteristics of the parentpolypeptide, e.g. antigenicity or structural domain characteristics. Thepresent markers refer to both tumor antigens present on or in cancerouscells or those that have been shed from the cancerous cells into bodilyfluids such as blood or serum. The present markers, as used herein, alsorefer to autoantibodies produced by the body to those tumor antigens. Inone aspect, a “marker” as used herein refers to both tumor antigens andautoantibodies that are capable of being detected in serum of a humansubject. It is also understood in the present methods that use of themarkers in a panel may each contribute equally to the composite score orcertain biomarkers may be weighted wherein the markers in a panelcontribute a different weight or amount to the final composite score.Biomarker may include any biological substance indicative of thepresence of cancer, including but not limited to, genetic, epigenetic,proteomic, glycomic or imaging biomarkers. Biomarkers include moleculessecreted by tumors or cancer, including gene, gene expression, andprotein-based products (tumor markers or antigens, cell free DNA, mRNA,etc.)

As used herein, the term “multiplier indicating an increased likelihoodof having the cancer” refers to a numerical value of the riskcategorization table and assigned to a patient sample after quantifyingthat patient's increased risk, relative to the cohort population, forthe presence of having cancer. When used in the context of the riskcategorization table when testing a human subject and correlated to arange of composite scores, the “multiplier indicating increasedlikelihood of having the cancer” becomes the “risk score” for each humansubject tested. See, FIG. 10.

As used herein, the term “normalization” and its derivatives, when usedin conjunction with measurement of biomarkers across samples and time,refer to mathematical methods where the intention is that thesenormalized values allow the comparison of corresponding normalizedvalues from different datasets in a way that eliminates or minimizesdifferences and gross influences between the datasets. In oneembodiment, multiple of median is used as the normalization methodologyfor the present methods.

As used herein, the terms “panel of markers”, “panel of biomarkers” andtheir synonyms, which are used interchangeably, refer to more than onemarker that can be detected from a human sample that together, areassociated with the presence of a particular cancer. In an embodiment ofthe present application, the presence of the biomarkers are notindividually quantified as an absolute value to indicate the presence ofa cancer, but the measured values are normalized and the normalizedvalue is aggregated (e.g., summed or weighted and summed, etc.) forinclusion within a biomarker composite score. As disclosed above, eachmarker in the panel may be given a weight of 1, or some other value thatis either a fraction of 1 or a multiple of 1, depending on thecontribution of the marker to the cancer being screened and the overallcomposition of the panel.

As used herein, the term “pathology” of (tumor) cancer includes allphenomena that compromise the well-being of the patient. This includes,without limitation, abnormal or uncontrollable cell growth, metastasis,interference with the normal functioning of neighboring cells, releaseof cytokines or other secretory products at abnormal levels, suppressionor aggravation of inflammatory or immunological response, neoplasia,premalignancy, malignancy, invasion of surrounding or distant tissues ororgans, such as lymph nodes, etc.

As used herein, the term “known prevalence of cancer” refers to aprevalence of a cancer in a population before the human subject istested and undergoes data analysis using the present methods. This knownprevalence of cancer, can be a prevalence reported in the literaturebased on retrospective data or be determined by a machine learningsystem that takes into account factors such as age and more immediateand relevant history or a combination thereof. In this instance, a knownprevalence of cancer in a cohort refers to a risk of having cancer priorto testing and analysis by the present methods and systems.

As used herein, the term “a positive predictive score,” “a positivepredictive value,” or “PPV” refers to the likelihood that a score withina certain range on a biomarker test is a true positive result. It isdefined as the number of true positive results divided by the number oftotal positive results. True positive results can be calculated bymultiplying the test sensitivity times the prevalence of disease in thetest population. False positives can be calculated by multiplying (1minus the specificity) times (1−the prevalence of disease in the testpopulation). Total positive results equal True Positives plus FalsePositives.

As used herein, the term “risk score” refers to a single numerical valuethat indicates an asymptomatic human subject's increased (or decreased)risk for the presence of a cancer as compared to the known prevalence ofcancer in the disease cohort. In certain embodiments of the presentmethods, the composite score is calculated for a human subject andcorrelated to a multiplier indicating an increased likelihood of havingthe cancer, wherein the composite score is correlated based on the rangeof composite scores for each stratified grouping or cohort populationgrouping in the risk categorization table. In this way the compositescore is converted to a risk score based on the multiplier indicatingincreased likelihood of having the cancer for the grouping that is thebest match for the composite score. See, FIG. 10.

As used herein the term, “Receiver Operating Characteristic Curve,” or,“ROC curve,” is a plot of the performance of a particular feature fordistinguishing two populations, patients with lung cancer, and controls,e.g., those without lung cancer. Data across the entire population(namely, the patients and controls) are sorted in ascending order basedon the value of a single feature. Then, for each value for that feature,the true positive and false positive rates for the data are determined.The true positive rate is determined by counting the number of casesabove the value for that feature under consideration and then dividingby the total number of patients. The false positive rate is determinedby counting the number of controls above the value for that featureunder consideration and then dividing by the total number of controls.

ROC curves can be generated for a single feature as well as for othersingle outputs, for example, a combination of two or more features thatare combined (such as, added, subtracted, multiplied, weighted, etc.) toprovide a single combined value which can be plotted in a ROC curve.

The ROC curve is a plot of the true positive rate (sensitivity) of atest against the false positive rate (1−specificity) of the test. ROCcurves provide another means to quickly screen a data set.

As used herein, the term “screening” refers to a strategy used in apopulation to identify an unrecognized cancer in asymptomatic subjects,for example those without signs or symptoms of the cancer. As usedherein, a cohort of the population (e.g., smokers aged 50 or older) arescreened for a particular cancer (e.g., lung cancer) wherein the presentmethod and system is applied to determine the quantified increased riskto those asymptomatic subjects for the presence of the cancer.

As used herein, the term “subject” refers to an animal, preferably amammal, including a human or non-human. The terms “patient” and “humansubject” may be used interchangeably herein.

As used herein, clinical data includes symptoms, differential diagnosis,active diseases, current medications, allergies, past disease history,family disease history,

As used herein, the term “tumor,” refers to all neoplastic cell growthand proliferation, whether malignant or benign, and all pre-cancerousand cancerous cells and tissues.

As used herein, the phrase “Weighted Scoring Method” refers to a methodthat involves converting the measurement of one biomarker that isidentified and quantified in a test sample into one of many potentialscores. A ROC curve can be used to standardize the scoring betweendifferent markers by enabling the use of a weighted score based on theinverse of the false positive % defined from the ROC curve. The weightedscore can be calculated by multiplying the AUC by a factor for a markerand then dividing by the false positive % based on a ROC curve. Theweighted score can be calculated using the formula:

Weighted Score=(AUC_(x)×factor)/(1−% specificity_(x))

wherein x is the marker; the, “factor,” is a real number or integer(such as 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,18, 19, 20, 21, 22, 23, 24, 25 and so on) throughout the panel; and the,“specificity,” is a chosen value that does not exceed 95%.Multiplication of a factor for the panel allows the user to scale theweighted score. Hence, the measurement of one marker can be convertedinto as many or as few scores as desired.

The weighting provides higher scores for biomarkers with a low falsepositive rate (thereby having higher specificity) for the population ofinterest. The weighting paradigm can comprise electing levels of falsepositivity (1−specificity) below which the test will result in anincreased score. Thus, markers with high specificity can be given agreater score or a greater range of scores than markers that are lessspecific.

Foundation for assessing the parameters for weighing can be obtained bydetermining presence of a marker in a population of patients with lungcancer and in normal individuals. The information (data) obtained fromall the samples are used to generate a ROC curve and to create an AUCfor each biomarker. A number of predetermined cutoffs and a weightedscore are assigned to each biomarker based on the % specificity. Thatcalculus provides a stratification of aggregate marker scores, and thosemarker scores can be used to define ranges that correlate to arbitraryrisk categories of whether one has a higher or lower risk of having lungcancer. The number of categories can be a design choice or may be drivenby the data. For example, a machine learning system may determineparameters for weighting markers, for thresholds, as well as forcreating cohort populations.

C. Methods for Determining a Likelihood for the Presence of a Cancer inan Asymptomatic (or Vaguely Symptomatic) Human Using Machine LearningClassifiers

In certain embodiments, provided herein is a computer implemented methodfor assessing the likelihood that a patient has cancer relative to apopulation. The asymptomatic patients that, after testing, have alikelihood for the presence of cancer relative to the population arethose that a physician may select for follow-up diagnostic testing suchas CT screening or analysis of a biopsy sample. Therefore, in certainembodiments, the method for assessing the likelihood that a patient hascancer relative to a population comprises 1) measuring the values of apanel of biomarkers in a sample from a patient; 2) obtaining clinicalparameters from the patient; 3) utilizing a classifier generated by amachine learning system to classify the patient into a categoryindicative of a likelihood of having cancer or into another categoryindicative of a likelihood of not having cancer, wherein the classifiercomprises a sensitivity of at least 70% and a specificity of at least80%, and wherein the classifier is generated using a panel of biomarkerscomprising at least two different biomarkers, and at least one clinicalparameter; and, 4) providing a notification to a user for diagnostictesting when a patient is classified into a category indicating alikelihood of having cancer. The generation of the classifier used inthe method herein is disclosed in detail below. Example 3 providesillustrative embodiments of a trained ANN for use to classify patientsinto a category indicative of a likelihood of having lung cancer or intoanother category indicative of a likelihood of not having lung cancer

In certain embodiments, is provided a method of determining a quantifiedincreased risk for the presence of a disease such as cancer in anasymptomatic human subject, may comprise: 1) measuring a concentrationor an amount of each marker of a panel of markers in a sample from thehuman subject; 2) determining a normalized value of each marker in asample from a human subject; 3) aggregating (e.g., summing, weighting,etc.) each normalized value to obtain a biomarker composite score forthe human subject; 4) determining a biomarker velocity for one or morebiomarkers; 5) obtaining data pertaining to a patient's medical recordsrelated to determining a risk for having cancer; 6) obtaining publicallyavailable information (e.g., environmental data, occupational data,genetic data, etc.) pertaining to an increased risk of cancer; 7)generating a master composite score for the human subject based on datafrom items 1-6 using a machine learning system; 8) quantifying theincreased risk for the presence of cancer for the human subject as arisk score, by matching the master composite score to a risk category ofa stratified cohort population or population, wherein each risk categorycomprises a numeric value indicating an increased likelihood of havingthe disease, e.g., cancer, correlated to a range of master compositescores, and wherein the risk categories, cohort population, andweighting of risk factors are determined by a machine learning system;and 9) providing a risk score for the human subject, whereby thequantified increased relative risk for the presence of a cancer in anasymptomatic patient relative to a population or cohort population hasbeen determined.

One or more steps of the techniques presented herein can be performed inan automated or partially automated manner by a machine learning system,as described herein. If the method were to be performed via a machinelearning system, then the performance of the method would furthernecessitate the use of the appropriate hardware, such as input, memory,processing, display and output devices, etc. and software.

i) Measuring Markers in a Sample

As part of the present method, a panel of markers from an asymptomatichuman subject may be measured. There are many methods known in the artfor measuring either gene expression (e.g., mRNA) or the resulting geneproducts (e.g., polypeptides or proteins) that can be used in thepresent methods. However, for at least 2-3 decades tumor antigens (e.g.CEA, CA-125, PSA, etc.) have been the most widely utilized biomarkersfor cancer detection throughout the world and are the preferred tumormarker type for the present invention.

For tumor antigen detection, testing is preferably conducted using anautomated immunoassay analyzer from a company with a large installedbase. Representative analyzers include the Elecsys® system from RocheDiagnostics or the Architect® Analyzer from Abbott Diagnostics. Usingsuch standardized platforms permits the results from one laboratory orhospital to be transferable to other laboratories around the world.However, the methods provided herein are not limited to any one assayformat or to any particular set of markers that comprise a panel. Forexample, PCT International Pat. Pub. No. WO 2009/006323; US Pub. No.2012/0071334; US Pat. Pub. No. 2008/0160546; US Pat. Pub. No.2008/0133141; US Pat. Pub. No. 2007/0178504 (each herein incorporated byreference) teaches a multiplex lung cancer assay using beads as thesolid phase and fluorescence or color as the reporter in an immunoassayformat. Hence, the degree of fluorescence or color can be provided inthe form of a qualitative score as compared to an actual quantitativevalue of reporter presence and amount.

For example, the presence and quantification of one or more antigens orantibodies in a test sample can be determined using one or moreimmunoassays that are known in the art. Immunoassays typically comprise:(a) providing an antibody (or antigen) that specifically binds to thebiomarker (namely, an antigen or an antibody); (b) contacting a testsample with the antibody or antigen; and (c) detecting the presence of acomplex of the antibody bound to the antigen in the test sample or acomplex of the antigen bound to the antibody in the test sample.

Well known immunological binding assays include, for example, an enzymelinked immunosorbent assay (ELISA), which is also known as a “sandwichassay”, an enzyme immunoassay (EIA), a radioimmunoassay (RIA), afluoroimmunoassay (HA), a chemiluminescent immunoassay (CLIA), acounting immunoassay (CIA), a filter media enzyme immunoassay (META), afluorescence-linked immunosorbent assay (FLISA), agglutinationimmunoassays and multiplex fluorescent immunoassays (such as the LuminexLab MAP), immunohistochemistry, etc. For a review of the generalimmunoassays, see also, Methods in Cell Biology: Antibodies in CellBiology, volume 37 (Asai, ed. 1993); Basic and Clinical Immunology(Daniel P. Stites; 1991).

The immunoassay can be used to determine a test amount of an antigen ina sample from a subject. First, a test amount of an antigen in a samplecan be detected using the immunoassay methods described above. If anantigen is present in the sample, it will form an antibody-antigencomplex with an antibody that specifically binds the antigen undersuitable incubation conditions as described herein. The amount,activity, or concentration, etc. of an antibody-antigen complex can bedetermined by comparing the measured value to a standard or control. TheAUC for the antigen can then be calculated using techniques known, suchas, but not limited to, a ROC analysis.

In another embodiment, gene expression of markers (e.g., mRNA) ismeasured in a sample from a human subject. For example, gene expressionprofiling methods for use with paraffin-embedded tissue includequantitative reverse transcriptase polymerase chain reaction (qRT-PCR),however, other technology platforms, including mass spectroscopy and DNAmicroarrays can also be used. These methods include, but are not limitedto, PCR, Microarrays, Serial Analysis of Gene Expression (SAGE), andGene Expression Analysis by Massively Parallel Signature Sequencing(MPSS).

Any methodology that provides for the measurement of a marker or panelof markers from a human subject is contemplated for use with the presentmethods. In certain embodiments, the sample from the human subject is atissue section such as from a biopsy. In another embodiment, the samplefrom the human subject is a bodily fluid such as blood, serum, plasma ora part or fraction thereof. In other embodiments, the sample is a bloodor serum and the markers are proteins measured therefrom. In yet anotherembodiment, the sample is a tissue section and the markers are mRNAexpressed therein. Many other combinations of sample forms from thehuman subjects and the form of the markers are contemplated.

ii) Biomarkers

However, before measurement can be performed a panel of markers needs tobe selected for a particular cancer being screened. Many markers areknown for diseases, including cancers and a known panel can be selected,or as was done by the present Applicants, a panel can be selected basedon measurement of individual markers in retrospective clinical sampleswherein a panel is generated based on empirical data for a desireddisease such as cancer, and preferably lung cancer. For example, USPublication No. 2013/0196868, the contents of which are hereinincorporated by reference.

Examples of biomarkers that can be employed include moleculesdetectable, for example, in a body fluid sample, such as, antibodies,antigens, small molecules, proteins, hormones, enzymes, genes and so on.However, the use of tumor antigens has many advantages due to theirwidespread use over many years and the fact that validated andstandardized detection kits are available for many of them for use withthe aforementioned automated immunoassay platforms.

In a particular embodiment, a panel of markers is selected based ontheir association with lung cancer. The tumor antigens used in the studyreported by Molina, et al., Am J Respir Crit Care Med. published online14 Oct. 2015 “Assessment of a Combined Panel of Six Serum Tumor Markersfor Lung Cancer”, namely, CEA, CA15.3, SCC, CYFRA 21-1, NSE and ProGRP,are representative of those that may be used with the present invention.

In embodiments, a panel of biomarkers in combination with clinicalparameters is selected from: 1) CA-125, CEA, CYFRA, NYESO Age, SmokingStatus, Pack Years, COPD; and 2) CEA, CYFRA, NSE, Smoking Status, Age,Nodule Size. In other embodiments, a panel of biomarkers is selectedfrom CA 19-9, CEA, CYFRA, NSE, Pro-GRP, SCC, CA 125, CA 15-3. CA 72.

Alternatively, the panel of markers is selected from anti-p53,anti-NY-ESO-1, anti-ras, anti-Neu, anti-MAPKAPK3, cytokeratin 8,cytokeratin 19, cytokeratin 18, CEA, CA125, CA15-3, CA19-9, Cyfra 21-1,serum amyloid A, proGRP and α₁-anti-trypsin (US 20120071334; US20080160546; US 20080133141; US 20070178504 (each herein incorporated byreference)). Many circulating proteins have more recently beenidentified as possible biomarkers for the occurrence of lung cancer, forexample the proteins CEA, RBP4, hAAT, SCCA [Patz, E. F., et al., Panelof Serum Biomarkers for the Diagnosis of Lung Cancer. Journal ofClinical Oncology, 2007. 25(35): p. 5578-5583.]; the proteins IL6, IL-8and CRP [Pine, S. R., et al., Increased Levels of CirculatingInterleukin 6, Interleukin 8, C-Reactive Protein, and Risk of LungCancer. Journal of the National Cancer Institute, 2011. 103(14): p.1112-1122.]; the proteins TNF-α, CYFRA 21-1, IL-1ra, MMP-2, monocytechemotactic protein-1 & sE-selectin [Farlow, E. C., et al., Developmentof a Multiplexed Tumor-Associated Autoantibody-Based Blood Test for theDetection of Non-Small Cell Lung Cancer. Clinical Cancer Research, 2010.16(13): p. 3452-3462.]; the proteins prolactin, transthyretin,thrombospondin-1, E-selectin, C-C motif chemokine 5, macrophagemigration inhibitory factor, plasminogen activator inhibitor, receptortyrosine-protein kinase, erbb-2, cytokeratin fragment 21.1, and serumamyloid A [Bigbee, W. L. P., et al.,—A Multiplexed Serum BiomarkerImmunoassay Panel Discriminates Clinical Lung Cancer Patients fromHigh-Risk Individuals Found to be Cancer-Free by CT Screening [Journalof Thoracic Oncology April, 2012. 7(4): p. 698-708.]; the proteins EGF,sCD40 ligand, IL-8, MMP-8 [Izbicka, E., et al., Plasma BiomarkersDistinguish Non-small Cell Lung Cancer from Asthma and Differ in Men andWomen. Cancer Genomics—Proteomics, 2012. 9(1): p. 27-35.].

Additional tumor markers include human epididymal protein 4 [RocheDiagnostics (2015)]; calcitonin, PAP, BR 27.29, Her-2 [Siemens (2015)];and HE-4 [Abbott (2015) and Fujirebio (2015)]. Novel ligands that bindto circulating, lung-cancer associated proteins which are possiblebiomarkers include nucleic acid aptamers to bind cadherin-1, CD30ligand, endostatin, HSP90a, LRIG3, MIP-4, pleiotrophin, PRKCI, RGM-C,SCF-sR, sL-selectin, and YES [Ostroff, R. M., et al., UnlockingBiomarker Discovery: Large Scale Application of Aptamer ProteomicTechnology for Early Detection of Lung Cancer. PLoS ONE, 2010. 5(12): p.e15003.]; monoclonal antibodies that bind leucine-rich alpho-2glycoprotein 1 (LRG1), alpha-1 antichymotrypsin (ACT), complement C9,haptoglobin beta chain [Guergova-Kuras, M., et al., Discovery of LungCancer Biomarkers by Profiling the Plasma Proteome with MonoclonalAntibody Libraries. Molecular & Cellular Proteomics, 2011. 10(12).]; andthe protein Cizl [Higgins, G., et al., Variant Cizl is a circulatingbiomarker for early-stage lung cancer. Proceedings of the NationalAcademy of Sciences, 2012.].

Autoantibodies that are proposed to be circulating markers for lungcancer include p53, NY-ESO-1, CAGE, GBU4-5, Annexin 1, and SOX2 [Lam,S., et al., EarlyCDT-Lung: An Immunobiomarker Test as an Aid to EarlyDetection of Lung Cancer. Cancer Prevention Research, 2011. 4(7): p.1126-1134.] and IMPDH, phosphoglycerate mutase, ubiquillin, Annexin I,Annexin II, and heat shock protein 70-9B (HSP70-9B) [Farlow, E. C., etal., Development of a Multiplexed Tumor-Associated Autoantibody-BasedBlood Test for the Detection of Non-Small Cell Lung Cancer. ClinicalCancer Research, 2010. 16(13): p. 3452-3462.].

Micro-RNAs that are proposed to be circulating markers for lung cancerinclude miR-21, miR-126, miR-210, miR-486-5p [Shen, J., et al., PlasmamicroRNAs as potential biomarkers for non-small-cell lung cancer. LabInvest, 2011. 91(4): p. 579-587.]; miR-15a, miR-15b, miR-27b,miR-142-3p, miR-301 [Hennessey, P. T., et al., Serum microRNA Biomarkersfor Detection of Non-Small Cell Lung Cancer. PLoS ONE, 2012. 7(2): p.e32307.]; let-7b, let-7c, let-7d, let-7e, miR-10a, miR-10b, miR-130b,miR-132, miR-133b, miR-139, miR-143, miR-152, miR-155, miR-15b,miR-17-5p, miR-193, miR-194, miR-195, miR-196b, miR-199a*, miR-19b,miR-202, miR-204, miR-205, miR-206, miR-20b, miR-21, miR-210, miR-214,miR-221, miR-27a, miR-27b, miR-296, miR-29a, miR-301, miR-324-3p,miR-324-5p, miR-339, miR-346, miR-365, miR-378, miR-422a, miR-432,miR-485-3p, miR-496, miR-497, miR-505, miR-518b, miR-525, miR-566,miR-605, miR-638, miR-660, and miR-93 [United States Patent Application20110053158]; hsa-miR-361-5p, hsa-miR-23b, hsa-miR-126, hsa-miR-527,hsa-miR-29a, hsa-let-7i, hsa-miR-19a, hsa-miR-28-5p, hsa-miR-185*,hsa-miR-23a, hsa-miR-1914*, hsa-miR-29c, hsa-miR-505*, hsa-let-7d,hsa-miR-378, hsa-miR-29b, hsa-miR-604, hsa-miR-29b, hsa-let-7b,hsa-miR-299-3p, hsa-miR-423-3p, hsa-miR-18a*, hsa-miR-1909, hsa-let-7c,hsa-miR-15a, hsa-miR-425, hsa-miR-93*, hsa-miR-665, hsa-miR-30e,hsa-miR-339-3p, hsa-miR-1307, hsa-miR-625*, hsa-miR-193a-5p,hsa-miR-130b, hsa-miR-17*, hsa-miR-574-5p and hsa-miR-324-3p. [UnitedStates Patent Application 20120108462]; miR-20a, miR-24, miR-25,miR-145, miR-152, miR-199a-5p, miR-221, miR-222, miR-223, miR-320 [Chen,X., et al., Identification of ten serum microRNAs from a genome-wideserum microRNA expression profile as novel noninvasive biomarkers fornonsmall cell lung cancer diagnosis. International Journal of Cancer,2012. 130(7): p. 1620-1628.]; hsa-let-7a, hsa-let-7b, hsa-let-7d,hsa-miR-103, hsa-miR-126, hsa-miR-133b, hsa-miR-139-5p, hsa-miR-140-5p,hsa-miR-142-3p, hsa-miR-142-5p, hsa-miR-148a, hsa-miR-148b, hsa-miR-17,hsa-miR-191, hsa-miR-22, hsa-miR-223, hsa-miR-26a, hsa-miR-26b,hsa-miR-28-5p, hsa-miR-29a, hsa-miR-30b, hsa-miR-30c, hsa-miR-32,hsa-miR-328, hsa-miR-331-3p, hsa-miR-342-3p, hsa-miR-374a, hsa-miR-376a,hsa-miR-432-staR, hsa-miR-484, hsa-miR-486-5p, hsa-miR-566, hsa-miR-92a,hsa-miR-98 [Bianchi, F., et al., A serum circulating miRNA diagnostictest to identify asymptomatic high-risk individuals with early stagelung cancer. EMBO Molecular Medicine, 2011. 3(8): p. 495-503.] miR-190b,miR-630, miR-942, and miR-1284 [Patnaik, S. K., et al., MicroRNAExpression Profiles of Whole Blood in Lung Adenocarcinoma. PLoS ONE,2012. 7(9): p. e46045.1.

In one embodiment, a panel of markers for lung cancer is selected fromCEA (GenBank Accession CAE75559), CA125 (UniProtKB/Swiss-Prot:Q8WXI7.2), Cyfra 21-1 (NCBI Reference Sequence: NP_008850.1),anti-NY-ESO-1 (antigen NCBI Reference Sequence: NP_001318.1), anti-p53(antigen GenBank: BAC16799.1) and anti-MAPKAPK3 (antigen NCBI ReferenceSequence: NP_001230855.1), the first three are tumor marker proteins andthe last three are autoantibodies.

In certain embodiments, a panel of markers comprises circulating markersassociated with colorectal cancer (CRC); those include the microRNAmiR-92 [Ng, E. K. O., et al., Differential expression of microRNAs inplasma of patients with colorectal cancer: a potential marker forcolorectal cancer screening. Gut, 2009. 58(10): p. 1375-1381.];aberrantly methylated SEPT9 DNA [deVos, T., et al., CirculatingMethylated SEPT9 DNA in Plasma Is a Biomarker for Colorectal Cancer.Clinical Chemistry, 2009. 55(7): p. 1337-1346.]

In certain embodiments, a panel of markers comprises markers associatedwith a cancer selected from bile duct cancer, bone cancer, pancreaticcancer, cervical cancer, colon cancer, colorectal cancer, gallbladdercancer, liver or hepatocellular cancer, ovarian cancer, testicularcancer, lobular carcinoma, prostate cancer, and skin cancer or melanoma.In other embodiments, a panel of markers comprises markers associatedwith breast cancer.

A panel can comprise any number of markers as a design choice, seeking,for example, to maximize specificity or sensitivity of the assay. Hence,an assay of interest may ask for presence of at least one of two or morebiomarkers, three or more biomarkers, four or more biomarkers, five ormore biomarkers, six or more biomarkers, seven or more biomarkers, eightbiomarkers or more as a design choice.

Thus, in one embodiment, the panel of biomarkers may comprise at leasttwo, at least three, at least four, at least five, at least six, atleast seven, at least eight, at least nine or at least ten or moredifferent markers. In one embodiment, the panel of biomarkers comprisesabout two to ten different markers. In another embodiment, the panel ofbiomarkers comprises about four to eight different markers. In yetanother embodiment, the panel of markers comprises about six differentmarkers.

Generally, a sample is committed to the assay and the results can be arange of numbers reflecting the presence and level (e.g., concentration,amount, activity, etc.) of presence of each of the biomarkers of thepanel in the sample.

The choice of the markers may be based on the understanding that eachmarker, when measured and normalized, contributed equally to determinethe likelihood of the presence of the cancer. Thus in certainembodiments, each marker in the panel is measured and normalized whereinnone of the markers are given any specific weight. In this instance eachmarker has a weight of 1.

In other embodiments, the choice of the markers may be based on theunderstanding that each marker, when measured and normalized,contributed unequally to determine the likelihood of the presence of thecancer. In this instance, a particular marker in the panel can either beweighted as a fraction of 1 (for example if the relative contribution islow), a multiple of 1 (for example if the relative contribution is high)or as 1 (for example when the relative contribution is neutral comparedto the other markers in the panel). Thus, in certain embodiments, thepresent methods further comprising weighting the normalized values priorto aggregation (e.g., summation, weighting and summation, etc.) of thenormalized values to obtain a composite score.

In still other embodiments, a neural net system may analyze values frombiomarker panels without normalization of the values. Thus, the rawvalue obtained from the instrumentation to make the measurement may beanalyzed directly.

The collection of markers in a multiplex assay may comprise varyinglevels of value or predictability in diagnosing disease. Hence, theimpact of any one marker on the ultimate determination may be weightedbased on the aggregated data obtained in screening populations andcorrelated with actual pathology to provide a more discriminating oreffective diagnostic assay.

One approach is to find an intermediate ground by expanding thequalitative transformation of quantitative data into multiplecategories, as compared to only a binary classification scheme.

a) Lung Cancer Biomarkers

One embodiment is directed to a method for assessing the likelihood oflung cancer. A research effort to identify panels of biomarkers thatincluded a survey of known tumor protein biomarkers coupled with adiscovery project for novel lung cancer specific biomarkers waspreviously conducted (PCT Publ. No. 2009/006323, incorporated herein byreference). This work indicates that a combination of markers can beused to increase sensitivity of testing for cancer without greatlyaffecting the specificity of the test. To accomplish this, markers weretested and analyzed in a way that is very different from the standardmethods. This effort culminated in the establishment of a panel of sixbiomarkers that in the aggregate yield significant sensitivity andspecificity for the early detection of lung cancer using the presentmethods. As disclosed herein, Applicants provide a new method andmachine learning system that can be utilized to identify smokers at thehighest levels of risk, based on a population or a cohort population,for follow-up testing by CT scanning.

In certain embodiments, the lung cancer biomarker panel comprises aseries of three tumor marker proteins and three autoantibodies. Tumormarkers, in such embodiments, are proteins released by the cancer itselfinto the patient's serum. Since the presence of these proteins or theirincreased expression is directly related to the cancer cells thesemarkers tend to be specific to cancer, however they may often be foundin more than one type of cancer. Furthermore, because these markers arederived directly from the tumor, their levels will depend (e.g.,linearly, non-linearly, etc.) on the size of the tumor. This can makethe markers less sensitive for the detection of early stage cancers.Autoantibodies are a function of the patient's immune response to theabnormal cancerous cells. Because the immune system amplifies itsresponse even to a small amount of antigen, autoantibodies may bedetected more easily in the early stage patient than proteins releasedby the cancer itself. Unfortunately due to the heterogeneity of thecancers that are classified as lung cancer and the individualdifferences in patient immune responses, a large panel of autoantibodiesis required to sensitively detect all lung cancers. Our panel combinesboth tumor markers and autoantibodies to achieve the greatestsensitivity for early stage lung cancer.

In certain embodiments, the tumor markers incorporated into the presentmethods for lung cancer comprise CEA, CA-125 and Cyfra 21-1. All threeof these markers have been extensively studied by others and arecurrently in clinical use for monitoring of other cancers. While none ofthese markers have fared well as a stand-alone marker for the earlydetection of lung cancer, two important points must be iterated: 1)these markers are not measured by the present method in the same waythat they have been measured in the past for other indications, and 2)these markers are not deployed as stand-alone markers but rather areincorporated as part of an integrated panel of markers forre-stratification of patient risk. Specifically, results in the presentmethods for lung cancer are not based on an absolute serum level, but onan increase in level as compared to the median levels in matched controlpatients. As such, individual marker values as a total serumconcentration are not measured; instead these three markers areincorporated in an aggregate biomarker composite score that has valueonly in re-categorizing patient risk for the presence of lung cancer.The tumor antigens used in the study reported by Molina, et al., Am JRespir Crit Care Med. published online 14 Oct. 2015 “Assessment of aCombined Panel of Six Serum Tumor Markers for Lung Cancer”, namely, CEA,CA15.3, SCC, CYFRA 21-1, NSE and ProGRP, are representative of thosethat may be used with the present invention.

In certain embodiments, three autoantibodies are utilized in the presentlung cancer test, wherein the autoantibodies comprise anti-p53,anti-NY-ESO-1 and anti-MAPKAPK3. As noted above, most autoantibodies areonly found in a limited number of patients. These three autoantibodiesare among those most commonly found in lung cancer, although each on itsown has a rather limited value because they do contribute to the overallsensitivity of the test. p53 is a well-known tumor suppressor proteinthat is often mutated in cancer. Such mutations may be enough to breaknatural immune tolerance to the protein and thus the source of anti-p53antibodies. NY-ESO-1 has been characterized as a tumor specific markerand thus auto-antibodies against this protein may represent a way tomeasure the levels of a tumor marker in early stage disease via immuneamplification. MAPKAPK3 is a kinase protein that can be activated byseveral oncogenic pathways and thus may be more commonly up-regulated inlung cancer leading to the development of autoantibodies targetedagainst it.

In certain embodiments, the method for determining a quantifiedincreased risk for the presence of a lung cancer in an asymptomatichuman subject, comprises: 1) measuring a panel of markers in sample froma human subject (e.g., that is at least 50 years of age or older and hasa history of smoking tobacco); 2) determining a normalized score foreach marker; 3) summing the normalized score to obtain a composite scorefor the human subject, 4) quantifying the increased risk for thepresence of the lung cancer for the human subject as a risk score,wherein the composite score is matched to a risk category of a groupingof stratified human subject populations wherein each risk categorycomprises a multiplier indicating increased likelihood of having thelung cancer correlated to a range of composite scores; and, 5) providinga risk score for the human subject, whereby the quantified increasedrisk for the presence of the lung cancer in an asymptomatic humansubject has been determined.

In certain embodiments, the method of determining a quantified increasedrisk for the presence of a disease such as cancer in an asymptomatichuman subject, may comprise: 1) measuring a concentration or an amountof each marker of a panel of markers in a sample from the human subject;2) determining a normalized value of each marker in a sample from ahuman subject; 3) aggregating (e.g., summing, weighting, etc.) thenormalized value using a machine learning system to obtain a biomarkercomposite score for the human subject; 4) determining a biomarkervelocity for one or more biomarkers; 5) obtaining data pertaining to apatient's medical records; 6) obtaining publically available information(e.g., environmental data, occupational data, genetic data, etc.)pertaining to an increased risk of cancer; 7) generating a mastercomposite score for the human subject based on data from items 1-6; 8)quantifying the increased risk for the presence of cancer for the humansubject as a risk score, by matching the master composite score to arisk category of a stratified cohort population or population, whereineach risk category comprises a numeric value indicating an increasedlikelihood of having the disease, e.g., cancer, correlated to a range ofmaster composite scores, wherein the risk categories, cohort population,and weighting of risk factors are determined by a machine learningsystem; and 9) providing a risk score for the human subject, whereby thequantified increased risk for the presence of a cancer in anasymptomatic human subject relative to a population or a cohortpopulation has been determined.

It is understood that the disease cohort (e.g., a human subject that isat least 50 years of age or older and has a history of smoking tobacco)is independently determined and in this instance is well understood tobe the “at risk” group for developing lung cancer. This present methodand machine learning system re-categorizes those at-risk patients intorisk categories by quantifying their true increased risk for thepresence of lung cancer relative to their disease cohort.

In other embodiments, provided herein are methods of assessing thelikelihood that a patient has lung cancer relative to a population or acohort population comprising the steps of: obtaining a sample from thepatient; measuring the levels of multiple biomarkers in the sample;calculating a biomarker composite score from the biomarker measurements;comparing the patient biomarker composite score to the biomarkercomposite scores of persons known to be at a high and a low risk forlung cancer; and determining the level of risk of the patient for havinglung cancer relative to the population.

In this instance, an asymptomatic patient's cancer risk level, relativeto a population or a cohort population is determined. In certainembodiments, the determination may comprise quantifying the risk levelrelative to the population or cohort population. In other aspects, themultiple biomarkers comprise two or more, three or more, four or more,five or more or six or more biomarkers. In one embodiment, the multiplebiomarkers comprise six markers selected from CEA, CA125, Cyfra 21-1,Pro-GRP, anti-NY-ESO-1, anti-p53, anti-Cyclin E2 and anti-MAPKAPK3.

In other embodiments, obtaining a biomarker composite score may furthercomprise normalizing the measured biomarker values and aggregating thenormalized values to form a biomarker composite score.

b) Pan-Cancer Biomarkers

In certain regions of the world, most notably in the Far East, manyhospitals and “Health Check Centers” offer panels of tumor markers topatients as part of their annual physicals or check-ups. These panelsare offered to patients without noticeable signs or symptoms of, orpredisposition to, any particular cancer and are not specific to any onetumor type (i.e. “pan-cancer”). Exemplary of such testing approaches isthe one reported by Y.-H. Wen et al., Clinica Chimica Acta 450 (2015)273-276, “Cancer Screening Through a Multi-Analyte Serum Biomarker PanelDuring Health Check-Up Examinations: Results from a 12-year Experience.”The authors report on the results from over 40,000 patients tested attheir hospital in Taiwan between 2001 and 2012. The patients were testedwith the following biomarkers: AFP, CA 15-3, CA125, PSA, SCC, CEA, CA19-9, and CYFRA, 21-1 using kits available from Roche Diagnostics,Abbott Diagnostics, and Siemens Healthcare Diagnostics. The sensitivityof the panel for identifying the four most commonly diagnosedmalignancies in that region (i.e. liver cancer, lung cancer, prostatecancer, and colorectal cancer) was 90.9%, 75.0%, 100% and 76%,respectively. Subjects with at least one of the markers showing valuesabove the cut-off point were considered positive for the assay. Noalgorithm was reported. Moreover, neither clinical parameters norbiomarker velocity were factored in with this test.

It is believed that the methods and machine learning systems accordingto the present invention can improve and enhance the pan-cancerbiomarker panel reported by the Taiwanese group and readily permit itsuse in other parts of the world. For example, an algorithm that combinesbiomarker values with clinical parameters could be employed thatautomatically improves using the machine learning software.

iii) Normalization of Data

In certain embodiments, the value obtained from measuring the marker inthe sample is normalized. There is no intended limitation on themethodology used to normalize the values of the measured biomarkersprovided that the same methodology is used for testing a human subjectsample as was used to generate the Risk Categorization Table. Inalternative embodiments, the concentration of the measured biomarkersare used as input values for either training the machine learningalgorithm or for classifying a patient into a category for thelikelihood of having cancer.

Many methods for data normalization exist and are familiar to thoseskilled in the art. These include methods such as backgroundsubtraction, scaling, multiple of the median (MoM) analysis, lineartransformation, least squares fitting, etc. The goal of normalization isto equate the varying measurement scales for the separate markers suchthat the resulting values may be combined according to a weighting scaleas determined and designed by the user or by the machine learning systemand are not influenced by the absolute or relative values of the markerfound within nature.

US Publ. No. 2008/0133141 (herein incorporated by reference) teachesstatistical methodology for handling and interpreting data from amultiplex assay. The amount of any one marker thus can be compared to apredetermined cutoff distinguishing positive from negative for thatmarker as determined from a control population study of patients withcancer and suitably matched normal controls to yield a biomarkercomposite score for each marker based on said comparison; and thencombining the biomarker composite scores for each marker to obtain abiomarker composite score for the marker(s) in the sample. In someembodiments, biomarker velocity may also be included for one or morebiomarkers.

The predetermined cutoffs can be based on ROC curves and the biomarkercomposite score for each marker can be calculated based on thespecificity of the marker. Then, the biomarker composite score can becompared to a predetermined biomarker composite score to transform thatbiomarker composite score to a quantitative determination of thelikelihood or risk of having lung cancer.

In certain embodiments, the quantitative determination of the likelihoodor risk of having lung cancer is based upon the biomarker compositescore, analysis of medical data pertaining to the patient, biomarkervelocity data, as well as other public sources of information related torisk factors for cancer.

Another method for score transformation or normalization is, forexample, applying the multiple of median (MoM) method of dataintegration. In the MoM method, the median value of each biomarker isused to normalize all measurements of that specific biomarker, forexample, as provided in Kutteh et al. (Obstet. Gynecol. 84:811-815,1994) and Palomaki et al. (Clin. Chem. Lab. Med.) 39:1137-1145, 2001).Thus, any measured biomarker level is divided by the median value of thecancer group, resulting in a MoM value. The MoM values can be aggregatedor combined (e.g., summed, weighted and added, etc.) for each biomarkerin the panel resulting in a panel MoM value or aggregate MoM score foreach sample.

In other embodiments, as additional samples are tested and presence ofcancer validated, the sample size of the cancer population and thenormals for determining the median can be increased to yield moreaccurate population data. In other embodiments, as additional samplesare tested and the presence of cancer is validated, this data is fedback into the machine learning system to generate more accuratepredictions of a patient's risk for having cancer.

In the next step of the present methods, the normalized value for eachbiomarker is aggregated to generate a biomarker composite score for eachsubject. In certain embodiments, this method comprises summing the MoMscore for each marker to obtain the biomarker composite score.

In other words, the biomarker composite score is derived by measuringthe levels of each of the markers used in a panel for a particularcancer in arbitrary units and comparing these levels to the medianlevels found in previous validation studies. In one embodiment, thecancer is lung cancer and the panel comprises the six markers disclosedabove wherein this method generates six initial scores representing themultiple of the median (MoM) for each marker for a given patient. Theseinitial scores are aggregated (e.g., summed, etc.) to yield thebiomarker composite score.

In certain embodiments, the markers are measured and those resultingvalues normalized and then aggregated to obtain a biomarker compositescore. In certain aspects, normalizing the measured biomarker valuescomprises determining the multiple of median (MoM) score. In otheraspects, the present method further comprises weighting the normalizedvalues before summing to obtain a biomarker composite score. In stillother embodiments, a machine learning system may be utilized todetermine weighting of the normalized values as well as how to aggregatethe values (e.g., determine which markers are most predictive, andassign a greater weight to these markers), based on the embodimentspresented herein.

D. Risk Categorization Table

Present embodiments further comprise quantifying the increased risk forthe presence of the cancer for the human subject as a risk score,wherein the composite score is matched to a risk category of a groupingof stratified human subject populations wherein each risk categorycomprises a multiplier (or percentage) indicating an increasedlikelihood of having the cancer correlated to a range of biomarkercomposite scores. This quantification is based on the pre-determinedgrouping of a stratified cohort of human subjects. In one embodiment,the grouping of a stratified population of human subjects, orstratification of a disease cohort, is in the form of a riskcategorization table. The selection of the disease cohort, the cohort ofhuman subjects that share cancer risk factors, are well understood bythose skilled in the art of cancer research. In certain embodiments, thecohort may share an age category and smoking history. However, it isunderstood that the cohort, and the resulting stratification, may bemore multidimensional and take into account further environmental,occupational, genetic, or biological factors (e.g. epidemiologicalfactors).

In certain embodiments, the grouping of a stratified human subjectpopulation used to determine a quantified increased risk for thepresence of a cancer in an asymptomatic human subject, comprises: atleast three risk categories, wherein each risk category comprises: 1) amultiplier (or percentage) indicating an increased likelihood of havingthe cancer, 2) a risk category and 3) a range of composite scores. Incertain aspects, wherein an individual risk score is generated byaggregating the normalized values determined from a panel of markers forthe cancer to obtain a biomarker composite score that is correlated to arisk category of the risk categorization table. In a further aspect, thenormalized values are determined as multiple of median (MoM) scores.

The risk identifier for a risk category is a label given to a specificgroup to provide context for the range of biomarker composite scores(and including other data, such as medical history) and the risk score,a multiplier (or percentage) indicating an increased likelihood ofhaving the cancer in each group. In certain embodiments, the riskidentifier is selected from low risk, intermediate-low risk,intermediate risk, intermediate-high risk and highest risk. These riskidentifiers are not intended to be limiting, but may include otherlabels as dictated by the data used to generate the table and/or furtherrefine the context of the data.

The risk score indicating an increased likelihood of having the canceris a numerical value, such as 13.4; 5.0; 2.1; 0.7; and 0.4. This valueis empirically derived and will change depending on the data, cohort ofthe subject population, type of cancer, medical records data,occupational and environmental factors, biomarkers, biomarker velocity,etc. and so on. Thus, the multiplier indicating an increased likelihoodof having the cancer may be a numerical value selected from 2, 3, 4, 5,6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 20, 21, 22, 23, 24,25, 26, 27, 28, 29, and 30, and so on, or some fraction thereof. Therisk score may be represented as a numerical multiplier, e.g., 2×, 5×,etc., wherein the numerical multiplier indicates the increasedlikelihood over the normal prevalence of cancer in the cohort populationthat formed the basis for the stratification, for the human subject atthe time of testing or as a percentage, indicating a percent increase inrisk relative to the normal prevalence of cancer. In other words, thehuman subject is from the same disease cohort as the one used togenerate the risk categorization table. In the example of lung cancer, adisease cohort may be a human subject aged 50 years or older with ahistory of smoking tobacco. Thus, for example, if a patient receives arisk score of 13.4×, then that human subject has a 13.4 times increasedrisk for the presence of the cancer relative to the population.

As disclosed above, this multiplier value is empirically determined andin the present instance is determined from retrospective clinicalsamples. As such the stratification of human subjects into cohortpopulations is based on analysis of retrospective clinical samples fromsubjects having a cancer wherein the actual incidence of cancer, or thepositive predictive score, is determined for each stratified grouping.The specifics of these techniques are detailed throughout theapplication and in the example section.

In general, once a population of human subjects has been stratified apositive predictive score can be determined, when retrospective sampleswith a known medical history are used, for each stratified grouping.This actual incidence of cancer in each of these groups is then dividedby the reported incidence of cancer across the population of humansubjects. For example, if the positive predictive score for one of thegroupings from the stratified population of human subjects was 27%, thisvalue would then be divided by the actual incidence of cancer across thecohort of the population that was stratified (e.g. 2%) to yield amultiplier of 13.5. In this scenario, the multiplier indicatingincreased likelihood of having the cancer is 13.5 and a subject testedthat had a biomarker composite score matched to this category would havea risk factor of 13.5×. In other words, at the time of testing, thathuman subject would be 13.5 times more likely to have the presence ofcancer than the general population in that particular cohort.

By stratifying data based on these techniques, a data transformationinto a more quantitative risk categorization is provided that offersimproved guidance for selecting patients for follow-up tests in light ofthe costs of lung cancer confirmation, for example a CAT scan or a PETscan, as well as patient compliance. Hence, because lung cancerincidence in the at risk population of heavy smokers is about 2%, thatpercentage was used as the cutoff point between a likelihood of havingcancer and not, meaning, at that level the individual was equally likelyto have cancer or not have cancer, that is, 1. Positive predictivevalues were determined using the disease prevalence of 2% and then thatpositive predictive value was divided by two to yield another risk valueinterpreted as the likelihood of having lung cancer as a multiple ofthat of the normal population risk, which can be considered as 1 orequally likely, or as a 2% risk based on population studies.

An example of a risk categorization table is provided in FIG. 10. Thefirst column of the risk categorization table is a range of mastercomposite scores. In the example provided herein, biomarker compositescores were generated from normalizing the data from the panel ofmeasured biomarkers. A machine learning system may be utilized toaggregate the normalized biomarker scores along with other information(e.g., medical information, publically available information, etc.) togenerate a master composite score. These master composite scores may begrouped to provide a range and to drive stratification of the cohortpopulation. The specifics of this methodology are detailed throughoutthe specification, including the Example section.

By transforming the biomarker composite score and other information(e.g., medical information, publically available information, etc.) intoa risk category that is based on cohort population data, the physicianand patient then can assess whether follow-up is required, necessary orrecommended based on whether there is a greater risk that is justslightly above that of any smoker, i.e., 2%, or is higher because of agreater master composite score, which indicates greater consideration bythe patient and physician.

By further data transformation of the PPV, the physician and patientwill be the beneficiary of a quantitative value indicating theprevalence of cancer amongst smokers which provides improved resolutionof the risk of cancer in light of the biomarker assay. Hence, a patientwith a master composite score of 20 or greater has a 13.4-fold greaterlikelihood of having lung cancer than any other heavy smoker, See FIG.10. That 13.4× multiplier translates to an overall risk of about 27% ofhaving lung cancer. That is, while all heavy smokers have a 1 in 50chance of having lung cancer prior to testing, with a master compositescore of 20 or more after testing, that individual has a 1 in 4 chanceof having lung cancer. Therefore, that person should consider follow-uptesting to visualize whether any cancer (e.g., lung cancer) is present,and to make any behavioral changes to reduce the risk of cancer.

Thus, in certain embodiments, the method for determining a quantifiedincreased risk for the presence of lung cancer in an asymptomatic humansubject, comprises: 1) measuring a level of CEA, CA125, Cyfra 21-1,anti-NY-ESO-1, anti-p53 and anti-MAPKAPK3 in a serum sample from thehuman subject, wherein the human subject is at least 50 years of age orolder and has a history of smoking tobacco; 2) determining a normalizedscore for each marker; 3) summing or aggregating the normalized score toobtain a biomarker composite score for the human subject, 4) quantifyingthe increased risk for the presence of the lung cancer for the humansubject as a risk score, wherein the biomarker composite score ismatched to one of at least three risk categories of a grouping of astratified cohort human subject population wherein each risk categorycomprises a multiplier or other numeric value indicating an increasedlikelihood of having the lung cancer correlated to a range of biomarkercomposite scores; and, 5) providing a risk score for the human subject,whereby the quantified increased risk for the presence of the lungcancer in an asymptomatic human subject has been determined.

In certain embodiments, the step of normalizing comprises determiningthe multiple of median (MoM) score for each marker. In this instance,the MoM score is then subsequently summed or aggregated to obtain abiomarker composite score.

After quantifying the increased risk for presence of the cancer in theform of a risk score, this score may be provided in a form amendable tounderstanding by a physician. In certain embodiments the risk score isprovided in a report. In certain aspects, the report may comprise one ormore of the following: patient information, a risk categorization table,a risk score relative to a cohort population, one or more biomarker testscores, a biomarker composite score, a master composite score,identification of the risk category for the patient, an explanation ofthe risk categorization table, and the resulting test score, a list ofbiomarkers tested, a description of the disease cohort, environmentaland/or occupational factors, cohort size, biomarker velocity, geneticmutations, family history, margin of error, and so on.

E. Use of Methods to Aid in the Early Detection of Lung Cancer

The use in a clinical setting of the embodiments presented herein arenow described in the context of lung cancer screening. It should beappreciated, however, that lung cancer is only one of many cancer typesthat can benefit from the embodiments of the present invention.

Primary care healthcare practitioners, who may include physiciansspecializing in internal medicine or family practice as well asphysician assistants and nurse practitioners, are among the users of thetechniques disclosed herein. These primary care providers typically seea large volume of patients each day and many of these patients are atrisk for lung cancer due to smoking history, age, and other lifestylefactors. In 2012 about 18% of the U.S. population was current smokersand many more were former smokers with a lung cancer risk profile abovethat of a population that has never smoked.

The aforementioned NLST study (See, background section) concluded thatheavy smokers over a certain age who undergo yearly screening with CTscans have a substantial reduction in lung cancer mortality as comparedto those who are not similarly screened. Nevertheless, for the reasonsdiscussed above, very few at risk patients are undergoing annual CTscreening. For these patients the testing paradigm presented hereinoffers an alternative.

A blood sample from patients with a heavy smoking history (e.g. havingsmoked at least a pack of cigarettes per day for 20 years or more) issent to a laboratory qualified to test the sample using a panel ofbiomarkers with adequate sensitivity and specificity for early stagelung cancer. Non limiting lists of such biomarkers are herein includedthroughout the specification including the examples. In lieu of blood,other suitable bodily fluids such a sputum or saliva might also beutilized.

A master composite score for that patient is then generated using thetechniques described herein. Using the master composite score thepatient's risk of having lung cancer, as compared to others having acomparable smoking history and age range, can then be calculated usinge.g., a risk categorization table, software application, etc., such asthe one shown in FIG. 10. If the risk calculation is to be made at thepoint of care, rather than at the laboratory, a software applicationcompatible with mobile devices (e.g. a tablet or smart phone) may beemployed.

Once the physician or healthcare practitioner has a risk score for thepatient (i.e. the likelihood that the patient has lung cancer relativeto a population of others with comparable epidemiological factors)follow-up testing can be recommended for those at higher risk, such asCT scanning. It should be appreciated that the precise numerical cut offabove which further testing is recommended may vary depending on manyfactors including, without limitation, (i) the desires of the patientsand their overall health and family history, (ii) practice guidelinesestablished by medical boards or recommended by scientificorganizations, (iii) the physician's own practice preferences, and (iv)the nature of the biomarker test including its overall accuracy andstrength of validation data.

It is believed that use of the embodiments presented herein will havethe twin benefits of ensuring that the most at risk patients undergo CTscanning so as to detect early tumors that can be cured with surgerywhile reducing the expense and burden of false positives associated withstand-alone CT screening.

F. Kits

One or more biomarkers, one or more reagents for testing the biomarkers,cancer risk factor parameters, a risk categorization table and/or systemor software application capable of communicating with a machine learningsystem for determining a risk score, and any combinations thereof areamenable to the formation of kits (such as panels) for use in performingthe present methods.

In certain embodiments, the kit can comprise (a) reagents containing atleast one antibody for quantifying one or more antigens in a testsample, wherein said antigens comprise one or more of: (i) cytokeratin8, cytokeratin 19, cytokeratin 18, CEA, CA125, CA15-3, SCC, CA19-9,proGRP, Cyfra 21-1, serum amyloid A, alpha-1-anti-trypsin andapolipoprotein CIII; or (ii) CEA, CA125, Cyfra 21-1, NSE, SCC, ProGRP,AFP, CA-19-9, CA 15-3 and PSA; (b) reagents containing one or moreantigens for quantifying at least one antibody in a test sample; whereinsaid antibodies comprise one or more of: anti-p53, anti-TMP21,anti-NPC1L1C-domain, anti-TMOD1, anti-CAMK1, anti-RGS1, anti-PACSIN1,anti-RCV1, anti-MAPKAPK3, anti-NY-ESO-1 and anti-Cyclin E2; and (c) asystem, an apparatus, or one or more computer programs/softwareapplications for performing the steps of normalizing the amount of eachantigen and/or antibody measured in the test sample, summing oraggregating those normalized values to obtain a biomarker compositescore, combining the biomarker composite score with other factorsassociated with an increased risk of cancer in a cohort population togenerate a master composite score, and determining and assigning a riskscore to each patient by correlating the master composite score to arisk categorization table using a software application and using thequantified increased risk for the presence of the cancer as an aid forfurther definitive cancer screening.

In the case of tumor antigens as biomarkers, the source of these kits ispreferably from a supplier who has developed, optimized, andmanufactured them to be compatible with one of the aforementionedautomated immunoassay analyzers. Examples of such suppliers includeRoche Diagnostics (Basel, Switzerland) and Abbott Diagnostics (AbbottPark, Ill.). The advantage of using kits so manufactured is that theyare standardized to yield consistent results from laboratory tolaboratory if the manufacturer's protocol for sample collection,storage, preparation, etc. are meticulously followed. That way datagenerated from a medical institution or region of the world where cancerscreening is commonplace can be used to build or improve the algorithmsaccording to the present invention that can be used in medicalinstitutions or regions where there is less history of this type oftesting.

The reagents included in the kit for quantifying one or more regions ofinterest may include an adsorbent which binds and retains at least oneregion of interest contained in a panel, solid supports (such as beads)to be used in connection with said absorbents, one or more detectablelabels, etc. The adsorbent can be any of numerous adsorbents used inanalytical chemistry and immunochemistry, including metal chelates,cationic groups, anionic groups, hydrophobic groups, antigens andantibodies.

In certain embodiments, the kit comprises the necessary reagents toquantify at least one of the following antigens, cytokeratin 19,cytokeratin 18, CA 19-9, CEA, CA-15-3, CA125, SCC, Cyfra 21-1, serumamyloid A, and ProGRP. In another embodiment, the kit comprises thenecessary reagents to quantify at least one of the following antibodiesanti-p53, anti-TMP21, anti-NPC1L1C-domain, anti-TMOD1, anti-CAMK1,anti-RGS1, anti-PACSIN1, anti-RCV1, anti-MAPKAPK3, anti-NY-ESO-1 andanti-Cyclin E2.

In some embodiments, the kit further comprises computer readable mediafor performing some or all of the operations described herein. The kitmay further comprise an apparatus or system comprising one or moreprocessors operable to receive the concentration values from themeasurement of markers in a sample and configured to execute computerreadable media instructions to determine a biomarker composite score,combine the biomarker composite score with other risk factors togenerate a master composite score and compare the master composite scoreto a stratified cohort population comprising multiple risk categories(e.g. a master risk categorization table) to provide a risk score.

G. Apparatus

Embodiments of the present invention further provide for an apparatusfor assessing a subject's risk level for the presence of cancer andcorrelating the risk level with an increase or decrease of the presenceof cancer after testing relative to a population or a cohort population.The apparatus may comprise a processor configured to execute computerreadable media instructions (e.g., a computer program or softwareapplication, e.g., a machine learning system, to receive theconcentration values from the evaluation of biomarkers in a sample and,in combination with other risk factors (e.g., medical history of thepatient, publically available sources of information pertaining to arisk of developing cancer, etc.) may determine a master composite scoreand compare it to a grouping of stratified cohort population comprisingmultiple risk categories (e.g. a risk categorization table) and providea risk score. The methods and techniques for determining a mastercomposite score and a risk score are described herein.

The apparatus can take any of a variety of forms, for example, ahandheld device, a tablet, or any other type of computer or electronicdevice. The apparatus may also comprise a processor configured toexecute instructions (e.g., a computer software product, an applicationfor a handheld device, a handheld device configured to perform themethod, a world-wide-web (WWW) page or other cloud or network accessiblelocation, or any computing device. In other embodiments, the apparatusmay include a handheld device, a tablet, or any other type of computeror electronic device for accessing a machine learning system provided asa software as a service (SaaS) deployment. Accordingly, the correlationmay be displayed as a graphical representation, which, in someembodiments, is stored in a database or memory, such as a random accessmemory, read-only memory, disk, virtual memory, etc. Other suitablerepresentations, or exemplifications known in the art may also be used.

The apparatus may further comprise a storage means for storing thecorrelation, an input means, and a display means for displaying thestatus of the subject in terms of the particular medical condition. Thestorage means can be, for example, random access memory, read-onlymemory, a cache, a buffer, a disk, virtual memory, or a database. Theinput means can be, for example, a keypad, a keyboard, stored data, atouch screen, a voice-activated system, a downloadable program,downloadable data, a digital interface, a hand-held device, or aninfrared signal device. The display means can be, for example, acomputer monitor, a cathode ray tube (CRT), a digital screen, alight-emitting diode (LED), a liquid crystal display (LCD), an X-ray, acompressed digitized image, a video image, or a hand-held device. Theapparatus can further comprise or communicate with a database, whereinthe database stores the correlation of factors and is accessible to theuser.

In another embodiment of the present invention, the apparatus is acomputing device, for example, in the form of a computer or hand-helddevice that includes a processing unit, memory, and storage. Thecomputing device can include, or have access to a computing environmentthat comprises a variety of computer-readable media, such as volatilememory and non-volatile memory, removable storage and/or non-removablestorage. Computer storage includes, for example, RAM, ROM, EPROM &EEPROM, flash memory or other memory technologies, CD ROM, DigitalVersatile Disks (DVD) or other optical disk storage, magnetic cassettes,magnetic tape, magnetic disk storage or other magnetic storage devices,or other medium known in the art to be capable of storingcomputer-readable instructions. The computing device can also include orhave access to a computing environment that comprises input, output,and/or a communication connection. The input can be one or severaldevices, such as a keyboard, mouse, touch screen, or stylus. The outputcan also be one or several devices, such as a video display, a printer,an audio output device, a touch stimulation output device, or a screenreading output device. If desired, the computing device can beconfigured to operate in a networked environment using a communicationconnection to connect to one or more remote computers. The communicationconnection can be, for example, a Local Area Network (LAN), a Wide AreaNetwork (WAN) or other networks and can operate over the cloud, a wirednetwork, wireless radio frequency network, and/or an infrared network.

H. Biomarker Velocity

Present invention embodiments may also utilize biomarker velocity toassess a risk of having cancer, e.g., lung cancer. As opposed toevaluating a single concentration of a biomarker, e.g., with regard towhether that biomarker is above a given threshold at a single point intime, biomarker velocities reflect biomarker concentrations as functionsof time. By evaluating a series of a biomarker levels over time (e.g.,time t=0, t=3 months, t=6 months, t=1 year, etc.) for an individualpatient, a velocity (or rate of increase) of the biomarker can bedetermined. Based on this type of methodology, a patient's risk ofdeveloping cancer can be stratified into high risk versus low risk (orany number of categories in between) based on the velocity.

Independent reports in the medical literature demonstrating thatmeasuring change in tumor antigen levels over time in ovarian,pancreatic, and prostate cancer is superior to a single reading includeMenon et al. J Clin Oncol May 11, 2015; Lockshin et al. PLOS One, April2014; and Mikropoulos et. al., J Clin Oncol 33, 2015 (suppl7; abstrl6).In at least one study, serial screening doubled the cancer detectionrate as compared to single, one-time threshold based screening.

Menon also disclosed an algorithm that identifies a spike in the levelsof one or more biomarkers, as compared to that patient's previous testscore, and automatically advises the patient and the provider to betested more frequently (e.g., quarterly) or to take other actions.

I. Artificial Intelligence Systems for Predictive Analytics for EarlyDetection of Lung Cancer

Artificial intelligence systems include computer systems configured toperform tasks usually accomplished by humans, e.g., speech recognition,decision making, language translation, image processing and recognition,etc. In general, artificial intelligence systems have the capacity tolearn, to maintain and access a large repository of information, toperform reasoning and analysis in order to make decisions, as well asthe ability to self-correct.

Artificial intelligence systems may include knowledge representationsystems and machine learning systems. Knowledge representation systemsgenerally provide structure to capture and encode information used tosupport decision making. Machine learning systems are capable ofanalyzing data to identify new trends and patterns in the data. Forexample, machine learning systems may include neural networks, inductionalgorithms, genetic algorithms, etc. and may derive solutions byanalyzing patterns in data. As generally understood in the art, linearstatistical models such as logistic regression are not consideredmachine learning algorithms.

In some embodiments, one or more neural nets may be used to classify anindividual patient into one of a plurality of categories, e.g., acategory indicative of a likelihood of cancer or a category indicatingthat lung cancer is not likely. Inputs to the neural net may include apanel of biomarkers associated with the presence of cancer as well asclinical parameters (see, e.g., FIG. 13). In embodiments, clinicalparameters include one or more of the following: (1) age; (2) gender;(3) smoking history in years; (4) number of packs per year; (5)symptoms; (6) family history of cancer; (7) concomitant illnesses; (8)number of nodules; (9) size of nodules; and (10) imaging data and soforth. In other embodiments, the clinical parameters include smokinghistory in years, number of packs per year, and age. In still otherembodiments, the panel of biomarkers comprises any two, any three, anyfour, any five, any six, any seven, any eight, any nine, or any tenbiomarkers. In preferred embodiments, the panel of biomarkers comprisestwo or more biomarkers selected from the group consisting of: AFP,CA125, CA 15-3, CA 19-19, CEA, CYFRA 21-1, HE-4, NSE, Pro-GRP, PSA, SCC,anti-Cyclin E2, anti-MAPKAPK3, anti-NY-ESO-1, and anti-p53. In otherembodiments, the panel of biomarkers comprises CA 19-9, CEA, CYFRA 21-1,NSE, Pro-GRP, and SCC. In still other embodiments, the panel ofbiomarkers comprises AFP, CA125, CA 15-3, CA-19-9, CEA, HE-4, and PSA.In yet other embodiments, the panel of biomarkers comprises AFP, CA125,CA 15-3, CA-19-9, Calcitonin, CEA, PAP, and PSA. In other embodiments,the panel of biomarkers comprises AFP, BR 27.29, CA12511, CA 15-3,CA-19-9, Calcitonin, CEA, Her-2, and PSA.

A variety of machine learning models are available, including supportvector machines, decision trees, random forests, neural networks or deeplearning neural networks. Generally, support vector machines (SVMs) aresupervised learning models that analyze data for classification andregression analysis. SVMs may plot a collection of data points inn-dimensional space (e.g., where n is the number of biomarkers andclinical parameters), and classification is performed by finding ahyperplane that can separate the collection of data points into classes.In some embodiments, hyperplanes are linear, while in other embodiments,hyperplanes are non-linear. SVMs are effective in high dimensionalspaces, are effective in cases in which the number of dimensions ishigher than the number of data points, and generally work well on datasets with clear margins of separation.

Decision trees are a type of supervised learning algorithm also used inclassification problems. Decision trees may be used to identify the mostsignificant variable that provides the best homogenous sets of data.Decision trees split groups of data points into one or more subsets, andthen may split each subset into one or more additional categories, andso forth until forming terminal nodes (e.g., nodes that do not split).Various algorithms may be used to decide where a split occurs, includinga Gini Index (a type of binary split), Chi-Square, Information Gain, orReduction in Variance. Decision trees have the capability to rapidlyidentify the most significant variables among a large number ofvariables, as well as identify relationships between two or morevariables. Additionally, decision trees can handle both numerical andnon-numerical data. This technique is generally considered to be anon-parametric approach, e.g., the data does not have to fit a normaldistribution.

Random forest (or random decision forest) is a suitable approach forboth classification and regression. In some embodiments, the randomforest method constructs a collection of decision trees with controlledvariance. Generally, for M input variables, a number of variables (nvar)less than M is used to split groups of data points. The best split isselected and the process is repeated until reaching a terminal node.Random forest is particularly suited to process a large number of inputvariables (e.g., thousands) to identify the most significant variables.Random forest is also effective for estimating missing data.

Neural nets (also referred to as artificial neural nets (ANNs)) aredescribed throughout this application. A neural net, which is anon-deterministic machine learning technique, utilizes one or morelayers of hidden nodes to compute outputs. Inputs are selected andweights are assigned to each input. Training data is used to train theneural networks, and the inputs and weights are adjusted until reachingspecified metrics, e.g., a suitable specificity and sensitivity. Anexample process of training a neural net is provided in FIG. 14.

ANNs may be used to classify data in cases in which correlation betweendependent and independent variables is not linear or in whichclassification cannot be easily performed using an equation. More than25 different types of ANNs exist, with each ANN yielding differentresults based on different training algorithms, activation/transferfunctions, number of hidden layers, etc. In some embodiments, more than15 types of transfer functions are available for use with the neuralnetwork. Prediction of the likelihood of having cancer is based upon oneor more of the type of ANN, the activation/transfer function, the numberof hidden layers, the number of neurons/nodes, and other customizableparameters.

Deep learning neural networks, another machine learning technique, aresimilar to regular neural nets, but are more complex (e.g., typicallyhave multiple hidden layers) and are capable of automatically performingoperations (e.g., feature extraction) in an automated manner, generallyrequiring less interaction with a user than a traditional neural net.

According to present invention embodiments, machine learning methods areable to classify individuals having a likelihood of cancer with at least70%, 75%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%,92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% sensitivity when thespecificity is set at 80%. This result is significantly better thanlinear statistical models such as threshold classification with a singlevariable or multivariate logistic regression with multiple variables. Insome embodiments, at least a 5% improvement, at least a 10% improvement,at least a 15% improvement, at least a 20% improvement, at least a 25%improvement, or at least a 30% improvement is achieved using artificialneural nets as compared to traditional statistical methods such astraditional logistic regression or multivariate linear regression. SeeFIGS. 15A-D and Example 4.

In other embodiments, the present machine learning methods are able toclassify individuals having a likelihood of cancer with at least 70%,75%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%,93%, 94%, 95%, 96%, 97%, 98%, or 99% sensitivity when the specificity isset at 85%. In certain embodiments, the present machine learning methodsare able to classify individuals having a likelihood of cancer with atleast 70%, 75%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%,91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% sensitivity when thespecificity is set at 90%.

In some embodiments, the neural net comprises one hidden layer, twohidden layers, three hidden layers, four hidden layers, or five hiddenlayers. Neural nets may contain any number of nodes, e.g., between 1 and1000 nodes, between 1 and 500 nodes, between 1 and 400 nodes, between 1and 300 nodes, between 1 and 200 nodes, between 1 and 100 nodes, between1 and 50 nodes, between 5 to 50, between 10 and 40 nodes, between 20 and30 nodes, between, 5 to 50 nodes, between 5 to 45 nodes, between 10 to40 nodes, between 15 to 35 nodes, between 20 to 30 nodes, or anycombination thereof. In some embodiments, nodes may be evenlydistributed, with each hidden layer receiving the same or about the samenumber of nodes. In other embodiments, nodes may be unevenlydistributed, e.g., with the first hidden layer receiving fewer nodesthan the second hidden layer or with the first hidden layer receivingmore nodes than the second hidden layer.

In illustrative embodiments, the neural net comprises two hidden layers.See FIG. 13. The first hidden layer may comprise 2 to 20 nodes and thesecond layer may comprise 15 to 35 nodes. The first hidden layer maycomprise 2 to 10 nodes and the second layer may comprise 15 to 25 nodes.In illustrative embodiments, the first hidden layer has 5 nodes and thesecond hidden layer has 20 nodes. See FIG. 15D and Example 4. In otherembodiments, the first hidden layer may comprise 15 to 35 nodes and thesecond layer may comprise 2 to 20 nodes. In other embodiments, the firsthidden layer may comprise 15 to 25 nodes and the second layer maycomprise 2 to 10 nodes. In other embodiments, the neural net has a totalof 20 to 30 nodes.

Neural networks have the capability of detecting complex nonlinearrelationships between variables, to determine which variables are themost predictive among a set of variables, and can discover relationshipsbetween variables that were not previously known. For example, one ofskill in the art may determine which groups of biomarkers in combinationwith specific clinical features are the most predictive of a likelihoodof having lung cancer. For example, an ANN may be used to determine thata subset of 6 biomarkers and a subset of 5 clinical features are highlypredictive, e.g., 90% or greater sensitivity at 80% specificity, toidentify individuals with an increased likelihood of having cancer.

In illustrative embodiments, the following biomarkers CEA, NSE, CYFRA21-1, CA19-9, Pro-GRP and SCC are evaluated using a neural net with thefollowing clinical features: smoking status, package year, patient age,family history of lung cancer, and symptoms.

In some embodiments, neural nets may be used to determine which inputsof a plurality of inputs are the most important for accuratelyidentifying patients that are likely to have lung cancer. For instance,starting with a large number of inputs, the neural net can identifywhich subset, e.g., which 5 to 15 inputs of a larger group of inputs arethe most predictive. This approach can help reduce costs in screening aswell as simplify computation, as not every biomarker or clinical factorlinked to lung cancer needs to be tested, but rather, only the mostpredictive inputs. See Example 6 and Table B, for a ranking ofbiomarkers and clinical factors for lung cancer.

Thus, present invention embodiments encompass neural net approaches todetermining which subset of biomarkers combined with which subset ofclinical factors and optionally other factors are the most predictive.In some embodiments, the neural net may be used to determine a total ofthree factors (e.g., at least two biomarkers and at least one clinicalfactor) that is highly predictive of the likelihood of lung cancer. Inother embodiments, the neural net may be used to determine a total offour factors (e.g., at least two biomarkers and at least one clinicalfactor) that is highly predictive of the likelihood of lung cancer. Inother embodiments, the neural net may be used to determine a total offive factors (e.g., at least two biomarkers and at least one clinicalfactor) that is highly predictive of the likelihood of lung cancer. Inother embodiments, the neural net may be used to determine a total ofsix factors (e.g., at least two biomarkers and at least one clinicalfactor) that is highly predictive of the likelihood of lung cancer. Inother embodiments, the neural net may be used to determine a total ofseven factors (e.g., at least two biomarkers and at least one clinicalfactor) that is highly predictive of the likelihood of lung cancer. Inother embodiments, the neural net may be used to determine a total ofeight factors (e.g., at least two biomarkers and at least one clinicalfactor) that is highly predictive of the likelihood of lung cancer. Inother embodiments, the neural net may be used to determine a total ofnine factors (e.g., at least two biomarkers and at least one clinicalfactor) that is highly predictive of the likelihood of lung cancer. Inother embodiments, the neural net may be used to determine a total often factors (e.g., at least two biomarkers and at least one clinicalfactor) that is highly predictive of the likelihood of lung cancer. Inother embodiments, the neural net may be used to determine a total ofeleven factors (e.g., at least two biomarkers and at least one clinicalfactor) that is highly predictive of the likelihood of lung cancer. Inother embodiments, the neural net may be used to determine a total oftwelve factors (e.g., at least two biomarkers and at least one clinicalfactor) that is highly predictive of the likelihood of lung cancer. Ingeneral, highly predictive indicates that the neural net is able toidentify patients likely to have lung cancer, with at least an 75%sensitivity, at least an 85% sensitivity, and at least a 90% sensitivity(at 80% specificity) or greater. Thus, the neural net may be used tooptimize the subset of inputs from the total number of possible inputs,in order to determine which subset—which of the biomarkers, clinicalfactors, or any other inputs to the neural network as disclosedherein—are the most predictive of the likelihood of having lung cancer.

In some embodiments, the neural net can be used to identify novelpredictors of a disease. For example, a novel biomarker or clinicalfactor or other type of input as disclosed in this application (e.g.,from the literature, from the environment, etc.) may be selected as aninput into a neural net, and it can be determined whether this input ispredictive of lung cancer. In some cases, the input may have no knownprevious association with lung cancer. See Example 7.

In some embodiments, the neural net can be used to rank inputs for adisease, to identify which inputs are the most predictive of a diseaseamong a larger population/group of inputs.

In some embodiments, inputs may be selected in order to improve theperformance of the neural net. For example, rather than picking the setof inputs that achieves the highest possible sensitivity with aclinically relevant specificity such as 80% or greater, the inputs areselected to reach a sensitivity threshold (e.g., 80% or greater), andonce reaching this threshold, the inputs are selected to optimizeperformance of the neural net, thereby improving the performance of theneural net.

Accordingly, systems, methods and computer readable media are presentedherein regarding using a machine learning system, e.g., a neural net, toidentify a patient's risk of having cancer. A set of data comprising aplurality of patient records, each patient record including a pluralityof parameters and corresponding values for a patient, and wherein theset of data also includes a diagnostic indicator indicating whether ornot the patient has been diagnosed with cancer is stored in a memory,accessible by the neural net or machine learning system. The pluralityof parameters includes various biomarkers, clinical factors and otherfactors which may be selected as inputs into the neural net system. Thediagnostic indicator is an affirmative indicator that the patient hascancer, e.g., a lung X-ray and/or biopsy confirming a diagnosis ofcancer. A subset of the plurality of parameters is selected for inputsinto the machine learning system, wherein the subset includes a panel ofat least two different biomarkers and at least one clinical parameter.

In order to train the machine learning system, the set of data (e.g.retrospective) is randomly partitioned into training data and validationdata. A classifier is generated using the machine learning system basedon the training data, the subset of inputs and other parametersassociated with the machine learning system as described herein. It isdetermined whether the classifier meets a predetermined ReceiverOperator Characteristic (ROC) statistic, specifying a sensitivity and aspecificity, for correct classification of patients. In embodiments, thespecificity is at least 80% and the sensitivity is at least 75%.

When the classifier does not meet the predetermined ROC statistic, theclassifier may be iteratively regenerated based on the training data anda different subset of inputs until the classifier meets thepre-determined ROC statistic. When the machine learning system meets thepredetermined ROC statistic, a static configuration of the classifiermay be generated. This static configuration may be deployed to aphysician's office for use in identifying patients at risk of havinglung cancer or stored on a remote server that can be accesses by thephysician's office.

Once the neural net has been trained on the training data, the neuralnet may be validated using the validation data. The validation data alsoincludes a plurality of parameters and corresponding values for apatient, and includes a diagnostic indicator indicating whether or notthe patient has been diagnosed with cancer. The validation data may beclassified using the classifier, and it may be determined whether theclassifier meets the predetermined ROC statistic based on this data.When the classifier does not meet the predetermined ROC statistic, theclassifier may be iteratively regenerated based on the training data anda different subset of the plurality of parameters, until the regeneratedclassifier meets the predetermined ROC statistic. The validation processmay then be repeated.

A user, with access to a computing device with the static classifier,may enter values corresponding to a patient into the computing device.The patient may then be classified, using the static classifier, into acategory indicative of a likelihood of having cancer or into anothercategory indicative of a likelihood of not having cancer. The system maythen send a notification to the user (e.g., a physician) recommendingadditional diagnostic testing (e.g., a CT scan, a chest x-ray or biopsy)when the patient is classified into the category indicative of alikelihood of having cancer.

In some embodiments, the machine learning system, e.g., the neural net,may be continuously trained over time. Test results obtained from thediagnostic testing, which confirm or deny the presence of cancer, may beincorporated into the training data set for further training of themachine learning system, and to generate an improved classifier by themachine learning system.

In general, a classifier may include but is not limited to a supportvector machine, a decision tree, a random forest, a neural network, or adeep learning neural network.

Thus, in some embodiments, the values of a panel of biomarkers in asample from a patient are measured. A classifier is generated by amachine learning system to classify the patient into a categoryindicative of a likelihood of having cancer or into another categoryindicative of a likelihood of not having cancer, wherein the classifiercomprises a sensitivity of at least 70% and a specificity of at least80%, and wherein the classifier is generated using the panel ofbiomarkers comprising at least two different biomarkers, and at leastone clinical parameter. When a patient is classified into a categoryindicating a likelihood of having cancer, a notification to a user fordiagnostic testing is provided. In embodiments, the category indicativeof a likelihood of having cancer may be further categorized intoqualitative groups (e.g. high, low, medium, etc.) for the likelihood ofhaving cancer, or into quantitative groups (e.g. a percentage,multiplier, risk score, composite score) of the likelihood of havingcancer.

In other embodiments, a computer implemented method for predicting alikelihood of cancer in a subject, using a computer system having one ormore processors coupled to a memory storing one or more computerreadable instructions for execution by the one or more processors, theone or more computer readable instructions comprising instructions for:storing a set of data comprising a plurality of patient records, eachpatient record including a plurality of parameters for a patient, andwherein the set of data also includes a diagnostic indicator indicatingwhether or not the patient has been diagnosed with cancer; selecting aplurality of parameters for inputs into a machine learning system,wherein the parameters include a panel of at least two differentbiomarker values and at least one type of clinical data; and generatinga classifier using the machine learning system, wherein the classifiercomprises a sensitivity of at least 70% and a specificity of at least80%, and wherein the classifier is based on a subset of the inputs.

Given the myriad of factors associated with the development of cancer,present invention embodiments utilize artificial intelligence/machinelearning systems, e.g., neural networks, for providing an improved, moreaccurate determination of an individual's likelihood (risk) of havingcancer. By providing the neural network system with a myriad of riskfactors associated with the presence of cancer, some of which have agreater impact than others, as well as a sufficiently large trainingdata set, the neural network may more accurately predict an individual'slikelihood (risk) of having cancer, offering patients and clinicians astrong, evidenced-based individualized risk assessment, with specificfollow-up recommendations for patients identified as high-risk. Machinelearning systems offer the ability to determine which of the myriad ofrisk factors are most important, as well as how to weight such factors.In addition, machine learning systems can evolve over time, as more databecomes available, to make even more accurate predictions.

In some embodiments, although the machine learning system can evolveover time to make more accurate predictions, the machine learning systemmay have the capability to deploy improved predictions on a scheduledbasis. In other words, the techniques used by the machine learningsystem to determine risk may remain static for a period of time,allowing consistency with regard to determination of a risk score. At aspecified time, the machine learning system may deploy updatedtechniques that incorporate analysis of new data to produce an improvedrisk score. Thus, the machine learning systems described herein mayoperate: (1) in a static manner; (2) in a semi-static manner, in whichthe classifier is updated according to a prescribed schedule (e.g., at aspecific time); or (3) in a continuous manner, being updated as new datais available.

While example embodiments presented herein refer to neural networks,present invention embodiments are not intended to be limited to neuralnetworks and may apply to any type of machine learning system. Thus, itis expressly understood that the embodiments presented herein are notintended to be limited strictly to neural networks, but may include anyform of artificial intelligence system of any type or of any combinationhaving the functionality described herein.

FIGS. 1A-1B are schematic diagrams of an example computing environmentin accordance with present invention embodiments. An example artificialintelligence computing system, also referred to as Neural Analysis ofCancer System (NACS) 100, for determining a risk of having cancer isshown. In summary, data from a patient's medical records and otherpublically available data is provided to a master neural net, whereinthe master neural net analyzes the data to predict a patient'sindividual risk of having cancer, relative to a cohort population.

In some embodiments, a plurality of other neural nets are utilized toprovide data to the master neural net in a form conducive for analysis.However, it is expressly understood that while NACS 100 may comprise aplurality of other neural nets (e.g., for data cleaning, for dataextraction, etc.) for providing the data in a suitable form, presentinvention embodiments also include providing data to the master neuralnet in a pre-defined form suitable for analysis without additionalprocessing by other neural nets. Thus, present invention embodimentsinclude the master neural net, as well as the master neural net incombination with any one or more other neural nets for data handling.

FIG. 1A comprises one or more neural nets NN 1-7, one or more databasesdb 10-60, public bus 65 and scaled bus 70, HIPPA Redaction andAnonymizer 75 as well as one or more knowledge stores (KS) 80, 110 and120. In general, each database 10-60 includes one or more types ofinformation associated with a risk of having cancer. In someembodiments, this information may be distributed across a plurality ofdatabases, while in other embodiments, the information may be includedin a single database. Each database may be local to or remote from eachof the other databases, and each neural net may be local to or remotefrom each of the databases. Each component of FIG. 1A is described inadditional detail as follows.

Primary EMR db 10 may be an electronic medical record (EMR) database,e.g., at a hospital, physician's office, etc., comprising one or moremedical records for one or more patients. Importantly EMR db 10 willsupply the biomarker levels or values of at least the patient's mostresent blood test. In other embodiments EMR may also provide thehistorical biomarker data from the patient, if serial testing wasconducted and the information is available, to permit biomarker velocityto be factored into the algorithm. In some embodiments, this database isa primary source of medical information (e.g., a patient's primary carephysician, hospital, specialist, or any other source of primary care,etc.) for a particular patient. Secondary EMR db 20 may be an EMRdatabase (e.g., at another hospital, at another physician's office)comprising medical records for a family member related to the patient orcomprising additional medical records for the patient not found inprimary EMR db 10). In some aspects, secondary EMR database 20 maycomprise more than one database. In general, EMR databases may comprisepatient medical records, including one or more of the following types ofinformation (e.g., age, gender, address, medical history, physiciannotes, symptoms, prescribed medications, known allergies, imaging dataand corresponding annotations, treatment and treatment outcomes, bloodwork, genetic testing, expression profiles, family histories, etc.).

In some embodiments, a first neural net (also referred to as NN1“Adder”) may be used for determining whether additional family memberinformation or patient information is available in secondary EMR db 20.In the event that additional information is available, secondary EMR db20 may be queried for this information.

A second neural net (also referred to as NN2 a “Cleaner” or NN2 b“Cleaner”) is used to identify missing, ambiguous or incorrect medicaldata (collectively referred to as “problematic data”) pertaining to thepatient. For example, neural net NN2 a may be used to identifyproblematic data from primary EMR database db 10, and neural net NN2 bmay be used to identify problematic data from secondary EMR database db20. In some embodiments, problematic data is remedied by obtaining theinformation as part of an outreach process through which other sourcesof information are utilized to remedy the problematic data. For example,a medical provider, the patient, or a family member may be contacted viatelephone, electronic mail or any other suitable means of communicationto resolve issues with problematic data. Alternatively, other EMRdatabases, other sources of electronic information, etc., may beaccessed to remedy the problematic data.

In some embodiments, the identified problematic data may be rankedaccording to potential impact to the determination of the risk score,such that the identified problematic data having a larger impact on therisk score is ranked as more important, in order to effectively allocateresources. For example, a missing zip code may have less of a potentialimpact on the risk score, and may therefore be tolerated, than errors insmoking history or lab tests, which would have a larger potentialimpact.

Clean data is sent to HIPPA Redaction and Anonymizer module 75, whichanonymizes data to comply with regulatory and other legal requirements.Unless otherwise authorized by the individual, individual health carerecords are usually anonymized in order to comply with privacy and otherregulations. In some embodiments, the individual records are anonymizedby replacing patient specific identification information (e.g., a name,social security number, etc.) with a unique identifier, providing a wayto identify the individual after the risk score has been determined.

Once the data has been cleaned, and has been anonymized by HIPPARedaction and Anonymizer 75, it may be stored in clean data knowledgestore (KS) 80, a repository generated by NACS 100. In some embodiments,once the problematic data has been remedied, the corrected data may bestored in the primary EMR db 10 or the secondary EMR db 20 itself, andtherefore, a separate knowledge base repository may not be needed.

A third neural net (also referred to as neural net NN3 “EMR Extractor”may be used for extracting specific relevant information from clean dataKS 80, which includes clean data from a patient's medical records.Neural net NN3 is trained to identify electronic medical records datathat are relevant for determining a risk score. For example, byproviding a sufficiently large number of training data sets in whichknown medical data of specified types are presented to the neural net,and by progressing through an iterative process in which potentialmedical data identified by the neural net is marked as correct orincorrect with regard to the known type, the neural net can be trainedto learn to identify specific medical data (e.g., images, unstructured,structured, etc.). Neural net NN3 may classify the data into differentdata types, e.g., raw images, numeric/structured data, BM velocity,unstructured data, etc., and the data may be stored in an extracted dataknowledge store (KS) 130 (see FIG. 1B).

NN3 may separate the identified patient data into different categoriesof information, e.g., raw images, unstructured data (e.g., physiciannotes, diagnosis, treatments, radiological notes, etc.), numerical data(e.g., blood test results, biomarkers), demographic data (age, weight,etc.) and biomarker velocity. Some types of data are subject to furtherprocessing, e.g., by another neural net, while others are sent to NN12(referred to as the “master” NN) for processing.

In other embodiments, a fourth neural net (also referred to as NN4“Puller” may be used for identifying relevant or requested data indatabases db 30-60, which is relevant to the patient's medical history.Examples of publically available databases include environmentaldatabases 30, employment databases 40, population databases 50, andgenetic databases 60. In general, this neural net may be used toidentify publically available data (e.g., data stored in databases, datain journal articles, publications, etc.) having information regardingrisk factors for having cancer, and pertinent to a patient's medicalhistory.

Examples of the types of information that may be extracted from the EMRdbs 10 and 20, to be provided to neural net NN4 for further analysis areprovided herein. For the environmental database db 30, the followingfields may be identified: patient location, work zip code, years at theaddress. For the occupational/employment database db 40, the number ofyears in a particular employment may be identified. For the populationdatabase db 50, patient demographics such as gender, age, number ofyears as a smoker, and family history may be identified. For the geneticdatabase db 60, mutations such as BRAF V600E mutation, EGFP Pos may beidentified. This information may be provided to neural net NN4, andcorresponding questions may be generated to determined relevant riskfactors.

For example, NACS 100 may identify an occupation of an individual, andgenerate a question to be asked to database db 40 regarding whether thatindividual's occupation has a known association with cancer. A patientmay have lived in a particular zip code for a determined number (e.g.,10) of years. Accordingly, a corresponding question of “What is thecancer risk for a patient living in that particular zip code for thepast 10 years?” could be generated and stored in public knowledge store(KS) 110, to be asked at a subsequent point in time. As another example,NACS 100 may generate a question to be asked to environment db 30regarding whether an individual's occupation is associated with anincreased risk of cancer. A patient may have spent a number of years(e.g., 20) employed in a certain profession (e.g., coal miner).Accordingly, the corresponding question of “What is the cancer risk forworking as a coal miner for 20 years?” could be generated and stored inpublic KS 110, to be asked at a subsequent point in time. Similarly,NACS 100 may also generate genetic questions, e.g., whether a mutationor other genetic abnormality from a patient's medical history has beenimplicated in the occurrence of cancer. In general, various types ofenvironmental, employment, population and genetic based questions may begenerated and stored in public KS 110 as questions to be asked, e.g.,with the assistance of a question-answer generation module, which areknown in the art.

Public bus 65, also shown in FIG. 1A, provides a communication networkwith which to provide questions related to a patient's medical historyto publically available databases, wherein the answers to the questionsmay be incorporated into the determination of the risk score. Forexample, information may be transmitted between public knowledge store(KS) 110, which may comprise questions generated by NACS 100 that are tobe asked to the databases, and the databases db 30-60 themselves.

As previously indicated, publically available databases db 30-60 maycomprise various types of information associated with a risk of havingcancer. Accordingly, present invention embodiments may utilize one ormore of these databases, in addition to the information from electronicmedical records db 10 and 20 an other information, to determine alikelihood for the presence of cancer for an individual.

For example, environment database db 30 may comprise environmental orgeographical factors associated with the presence of cancer. Forexample, certain geographical zip codes may indicate environmentalfactors, e.g., presence of a carcinogen within a given area, radioactiveelements, toxins, chemical spills or contamination, etc., associatedwith an increased risk of having cancer. Database db 30 may alsocomprise information regarding environmental factors associated with thedevelopment of a disease such as cancer, e.g., smog levels, pollutionlevels, exposure to secondhand smoke, etc.

Employment database db 40 may comprise information linking some types ofemployment to an increased risk of having cancer. For example, certainindustries and job types, e.g., coal miner, construction workers,painters, industrial manufacturers, etc., may have an increasedlikelihood of exposure to radiation or cancer-causing chemicals,including asbestos, lead, etc., which increases the risk for havingcancer.

Population database db 50 comprises information, usually anonymized, fora population of individuals having a diagnosis of cancer. In someembodiments, database db 50 may include profiles for individualpatients, each patient profile including various types of information,e.g., age, gender, smoking history in years and number of packs per day,imaging data, employment, residence, biomarker scores, biomarkercomposite scores, or biomarker velocities, etc., that may influence anindividual's risk of having cancer. By collecting and analyzing thistype of data, cohort populations may be determined by a neural net.

Genetic db 60 may include genes identified as being associated with anincreased risk of having cancer. For example, genetic db 60 may includeany publically available database or repository, as well as journalarticles, research studies, or any other source of information thatlinks a particular genetic sequence, mutation, or expression level to anincreased risk of having cancer.

Any of databases 30-60 may comprise a plurality of databases. Forexample, environment db 30 may comprise a plurality of databases, eachdatabase including a different type of environmental information,employment db 40 may comprise a plurality of databases, each databaseincluding a different type of employment information, population db 50may comprise a plurality of databases, each database comprisingpopulation information, and genetic db 60 may comprise a plurality ofdatabases, each database comprising a different type of geneticinformation.

Information may be transmitted between databases db 30-60 and stored inscaled knowledge store (KS) 120 via scaled bus 70. For example, scaledKS 120 may comprise answers to the questions generated by NACS 100 thatwere asked to databases dbs 30-60. Both public KS 110 and scaled KS 120are repositories that are created by NACS.

To facilitate asking questions to dbs 30-60, a fifth set of neural nets(also referred to as NN5 a, NN5 b, NN5 c, or NN5 d) are used foridentifying specific data in a specific subject matter knowledge sourceor database (e.g., dbs 30-60). For example, neural net NN5 a may beutilized to identify specific environmental data in environment db 30,neural net NN5 b may be utilized to identify specific employment data inemployment db 40, neural net NN5 c may be utilized to identify specificpopulation data in population db 50, and neural net NN5 d may beutilized to identify specific genetic data in genetic db 60. Knowledgesources or databases considered to be leading sources of information ina specific field may be selected for inclusion with dbs 30-60. Examplesof knowledge sources include journal articles, databases, presentations,gene sequence or gene expression repositories, etc. In some aspects,each category of information or each source of information itself mayhave a corresponding neural net for identifying relevant data, and insome embodiments, the neural net may be trained to recognize informationin a vendor-specific manner. Each database also may comprise bothstructured and unstructured data.

In some embodiments, if a new study reports a new genetic link tocancer, or a new geographical “hotspot” for the occurrence of cancer,the NACS system 100 could search information in databases 30-60 toreevaluate its determined risk and provide an updated risk to a patientor physician. For example, a question could be generated and stored inpublic KS 110, which would be asked to dbs 30-60 at predefined intervals(e.g., monthly, quarterly, annually, etc.), and the risk determinationcould be updated periodically.

In the medical domain, new clinical literature and guidelines arecontinuously being published, describing new screening procedures,therapies, and treatment complications. As new information becomesavailable, queries may be automatically run by a question-answergeneration module without active involvement (in an automated manner).The results may be proactively sent to the physician or patient orstored in scaled KS 120 for subsequent use.

In some embodiments, NACS 100 can automatically generate queries fromthe semantic concepts, relations, and data extracted from dbs 10 and 20,using, e.g., a question-answer module. Using semantic concepts andrelations, queries for the question-answering system can beautomatically formulated. Alternatively, it is also possible for aphysician or patient to enter queries in natural language or other ways,through a suitable user interface.

In still other embodiments, a sixth set of neural nets (also referred toas NN6 a, NN6 b, NN6 c, or NN6 d) is used to scale each database output,or answer to a question from dbs 30-60 from, e.g., a 0 to 9 range forweighting. For example, the output zip code of 14304 for the Love Canal,N.Y. might be scaled as ‘9’ to indicate high risk, whereas the outputzip code of 86336 for Sedona, Ariz. may be a ‘0’ to indicate low risk.Many different types of scaling are covered by embodiments of theinvention. In some embodiments, database outputs are scaled according toa common reference, regardless of the database, while in otherembodiments, database outputs are scaled on a relative basis, e.g., suchthat a weighting of ‘9’ for a given database may not have the sameimpact as a weighing of ‘9’ for another database. Depending upon thedisparity of the data, each database may have its own correspondingneural net to scale relevant information.

In some embodiments, each answer is generated along with confidences andsources of information. The confidence of each answer can, for example,be a number between 0 and 1, 0 and 10, or any desired range.

In still other embodiments, a seventh neural net (also referred to asNN7 “Gene Snip” is used to identify similar and/or related genes withreference to the genes associated with the patient's medical history.Similar or related genes may be identified on the basis of literature,public databases of genetic information, etc. The neural net NN7 mayalso output the types of genes that are relevant for further analysis,in addition to the risks associated with the identified gene.

According to the example computing environment shown in FIG. 1A,extracted data from neural net NN3 is sent to other neural nets foranalysis via extracted data bus 138. Output data from the externaldatabases db 30-60, which may be stored in scaled KS 120, is loaded ontoscaled bus 70 and provided to another neural net for analysis as scaleddemographic data 170. Data from neural net NN7 is provided to anotherneural net for analysis as genetic data 165, and population data 160 isprovided as input to other neural nets. Each of these outputs are shownwith reference to FIG. 1B.

As shown in FIG. 1B, data from extracted data bus 138 may be classifiedinto different types of data. Data may be classified as raw images 155(e.g., X-rays, CT scans, MRI, ultrasounds, EEG, EKG, etc.), and the rawimages may be provided to NN10 for further analysis as described herein.Data may also be classified into biomarker (BM) velocity data 145, andthis data may be provided to neural net NN9 for further analysis asdescribed herein. Data may be further classified into numeric data 150,e.g., age, ICD, blood/biomarker tests, smoking history (years and packsper day), diagnosis (Dx), gender, etc. or unstructured data 140.Unstructured data 140 may include text or numeric based information,e.g., physician notes, annotations, etc. NN8 may analyze unstructureddata 140 as described herein using Natural Language Processing and otherwell established techniques.

An eighth neural net (also referred to as neural net NN8 NaturalLanguage Processing (“NLP”) is utilized to analyze unstructured data140, e.g., physician notes, other EMT text (e.g., radiology, history ofpresent illness (HPI)). After processing by neural net NN8, the data maybe separated into multiple categories including a text-based category,including lab reports, progress notes, impressions, patient histories,etc., as well as derived data, which includes data derived from thetext-based data, e.g., years of smoking and frequency of smoking (e.g.,how many packs a day).

In other embodiments, a ninth neural net (also referred to as NN9) isutilized to analyze biomarker (BM) velocity. This neural net, which maybe trained in a supervised or unsupervised manner, analyzes the velocityof biomarkers of a biomarker panel and determines whether the velocityis indicative of the presence of cancer. Markers may include CYFRA, CEA,ProGrp, etc., and the neural net may analyze both the absolute value andrelative value as a function of time. In some aspects, having a velocityabove a threshold value may be indicative of the presence of cancer.Individual as well as group velocity scores for a combination ofbiomarkers may be generated. In some embodiments, this neural net may beuntrained, and may identify previously unknown associations. Individualas well as group velocities may be determined for panels.

In other embodiments, a tenth neural net (also referred to as NN10“Sieve”) is utilized to analyze raw images, e.g., XRAYs, CT scans, MRIs,etc., and extract clinical imaging data. In some embodiments, thisneural net NN10 may extract portions of images relevant to determiningan increased risk of cancer.

In other embodiments, an eleventh neural net (also referred to as neuralnet NN11 “Untrained Cohort Analysis”) is utilized to identify patternsin cohort groupings. A particular cohort grouping may change as afunction of time based upon the decisions made by the neural net NN11.For example, age correlates with risk of developing cancer, but theoptimal grouping (e.g., ages 42-47, 53-60, etc.) is not known. Theneural net NN11 may initially determine that a cohort population of ages53-60 with a smoking history of ten years carries an increased risk of50%. The optimal grouping (cohort) may change as additional data becomesavailable. By utilizing an untrained neural net, such as neural netNN11, to discover naturally occurring grouping patterns (e.g., a clusterof individuals developing cancer at a given age and based on a similarsmoking history), the grouping patterns may be identified and analyzedto determine an optimal cohort for a given patient. In some embodiments,NN11 is untrained and will be self taught. For example, age is animportant factor. The best age range or grouping may not be known, e.g.,whether the age range should be 42-47, 53-60, and so forth. Moreover,the grouping may change as other risk factors are integrated into theanalysis. By analyzing the data using an untrained NN, the NN mayutilize clustering to find relevant groupings. The algorithm mayiteratively try different grouping and different risk factors untilfinding an optimal cohort for the given patient. In many cases,untrained NN will find associations that would be discovered bytraditional techniques.

A twelfth neural net (also referred to as neural net NN12 “Master NN”)receives a plurality of inputs, each associated with occurrence of adisease, e.g., such as cancer. In this example, NN12 receives inputs ofthe patient EMR data bus 142, some of which are further processed usingneural nets NN8-10 as well as scaled demographic data 170, genetic data165 and population data 160 after being processed by NN11 to generatecohort data.

Input data to neural net NN12 may be normalized according to thetechniques presented herein. Neural net NN12 assigns weights to eachinput, and performs an analysis to make a prediction (a % likelihood) ofhaving cancer based on these risk factors. Initially, the assignedweights may be determined from training the neural net using a data setthat includes patients with a cancer diagnosis, their medical history,and other associated risk factors. As additional data becomes availableabout risk factors for cancer (e.g., new risk factors, etc.), this datamay be integrated into neural net NN12 and the corresponding weightingmay evolve as a function of time. The output data of neural net NN12 maybe stored in db 10 and/or db 20 as part of a feedback loop.

NN12 is trained to produce the following outputs, as shown at block 180,including patient risk scores (e.g., an individual patient's % risk in agiven cohort, margin of error, size of cohort, labels of cohort, etc.),major risk factors identified (may be different from the cohortpopulation), recommended diagnosis (DX) and treatment success factors.Neural net NN 12 may also generate other types of data as describedherein.

Neural net NN12 may utilize feedback to write output back to databasesdb 10 and db 20 for continuous improvement of the machine learningsystem, allowing the machine learning system to make more accuratepredictions by continually incorporating new data into the training set.As new patient data becomes available, e.g., confirming or denying thatthe patient has cancer, NACS system 100 may utilize this information foradditional intrinsic training, allowing the determined % risk score toimprove in accuracy. For example, if the patient is diagnosed withcancer, then types of treatments, outcomes (longevity) and success ratesmay be complied, and fed back into the system, allowing the system to betrained on successful treatments and best (positive) clinical indicatorswith the best sensitivity, selectivity, and lowest ambiguity. If thepatient is not diagnosed with cancer, then this information is fed backinto the system to train for best negative clinical indicators. Thephysician's diagnosis can be compared with the NACS risk score as well.

Present invention embodiments may include at least one EMR, e.g., db 10,a master neural net NN12 for performing a risk determination, and anyone or more of the aforementioned public databases db 30-60, as well asany one or more of the aforementioned knowledge stores 80, 110, 120,130, and 135, and any one or more of the neural nets NN1-11.

In some embodiments, the neural net may be trained to identifyinformation provided in a vendor-specific format.

In other embodiments, neural net NN12 may determine that insufficientinformation is present to make a determination regarding a patient'srisk score.

FIG. 2A shows an example of a neural net. As previously indicated,neural net systems generally refer to artificial neural network systems,comprising a plurality of artificial neurons or nodes, such that thesystem architectures and concepts behind the design of neural netsystems are based on biological systems and/or models of neurons.

For example, components of a neural network may include an input layercomprising a plurality of input processing elements or nodes 210, one ormore “hidden” layers 220 comprising processing elements or nodes, and anoutput layer 230 to the hidden layer comprising a plurality of outputprocessing elements or nodes. Each node may be connected to one or moreof the other nodes as part of the hidden computational layer. The hiddenlayer 220 may comprise a single layer or multiple layers, with eachlayer comprising a plurality of interconnected computational nodes,wherein the nodes of one layer are connected to another layer.

Neural nets may also comprise weighting and aggregations operations aspart of the hidden layer. For example, each input may be assigned arespective weight, e.g., a number in a range of 0 to 1, 0 to 10, etc.The weighted inputs may be provided to the hidden layer, and aggregated(e.g., by summing the weighted input signals). In some embodiments, alimiting function is applied to the aggregated signals. Aggregatedsignals (which may be limited) from the hidden layer may be received bythe output layer, and may undergo a second aggregation operation toproduce one or more output signals. An output limiting function may alsobe applied to the aggregated output signals, resulting in a predictedquantity by the neural net. Many different configurations are possible,and these examples are intended to be non-limiting.

Neural net systems may be configured for a specific application, e.g.,pattern recognition or data classification, through a learning processreferred to as training, as described herein. Thus, neural networks canbe trained to extract patterns, detect trends, and performclassifications on complex or imprecise data, often too complex forhumans, and in many cases too complex for other computer techniques toanalyze.

Information within a neural net, as shown in FIG. 2B may also flowbidirectionally. For example, data flowing from the input layer to theoutput layer is shown as forward activity and the error signal flowingfrom the output layer to the input layer is represented as feedback or“backpropagation”. The error signal may feed back into the system, andas a result, the neural net may adjust the weights of one or moreinputs.

Training Neural Nets

Many different techniques for the operation of neural networks are knownin the art. Neural nets typically undergo an iterative learning ortraining process, in which examples are presented to the neural net oneat a time, before the neural net is placed in production mode to operateon (non-training) data. In some cases, the same training dataset may bepresented to the neural net multiple times, until the neural netconverges on a correct solution, reaching specified criteria, e.g., agiven confidence interval, a given error, etc. Typically, a set ofvalidation data (e.g., the dataset) is sufficiently large to allowconvergence of the neural network, allowing the neural network to beable to predict within a specified margin of error, the correctclassification (e.g., increased risk of cancer or no increased risk ofcancer) of non-training data. See Example 3.

Training may occur in a supervised or unsupervised manner. In asupervised learning process, a neural net may be provided with a largetraining data set in which the answers are unambiguously known. Forexample, the neural net may be presented with test cases from thedataset in a serial manner, along with the answer for the dataset. Byproviding the neural net with a large dataset comprising both positiveand negative answers (e.g., relevant data and non-relevant data) andtelling the neural net which data corresponds to positive answers andwhich to negative answers, the neural net may learn to recognizepositive answers (e.g., relevant data) provided that a sufficientlylarge dataset is provided. In a supervised learning process, anindividual or administrator may interact with the machine learningsystem to provide information regarding whether the result determined bythe machine learning system is accurate.

In an unsupervised learning process, a neural net may also be providedwith a large training data set. However, in this case, the answers as towhich data are positive and which data are negative are not provided tothe neural net and may not be known. Rather, the neural net may usestatistical means, e.g., K-means clustering, etc., to determine positivedata. By providing the neural net with a large dataset comprising bothpositive and negative answers (e.g., relevant data and non-relevantdata), the neural net may learn to recognize patterns in data.

Each input to a neural net is typically weighted. In some embodiments,the initial weighting (e.g., random weighting, etc.) is determined bythe machine learning system, while in other cases, the initial weightingmay be user-defined. The machine learning system processes the inputinformation with the initial weighting to determine an output. Theoutput may then be compared to the training data set, e.g.,experimentally obtained and validated data. The machine learning systemmay determine an error signal between the computationally obtainedprediction and the training data set, and feed or propagate this signalback through the system into the input layer, resulting in adjustment ofthe input weighting. In other embodiments, the error signal may be usedto adjust weights in the hidden layer in order to improve the accuracyof the neural net. Accordingly, during the training process, the neuralnet may adjust the weighting of the inputs and/or hidden layer duringeach iteration through the training data set. As the same set oftraining data may be processed multiple times, the neural net may refinethe weights of the inputs until reaching convergence. Typically, thefinal weights are determined by the machine learning system.

As an example of a training process for neural net NN1, neural net NN1may be trained to look for indications that secondary EMR db 20 hasrelevant data. For example, neural net NN1 may be presented with adataset from EMR system db 20 having the same name and social securitynumber as the patient, along with a confirmation that the patient fromthe secondary EMR matches the primary EMR. Similarly, the adder may bepresented with a data set from another EMR system having the same nameand a different social security number as the patient, along with aconfirmation that the data from the secondary EMR does not match thepatient from the primary EMR. Based on this type of training, the neuralnet can learn to distinguish which records from which databases matchspecific patients.

As another example, and with reference to neural nets NN2 a and NN2 b,these neural nets may be trained to recognize missing data. For example,these neural nets may be presented with a complete dataset for a patientwith an indication that the data set is complete. These neural nets maythen be presented with another dataset with specified missing data.After a sufficiently large training session, the neural net will learnthe concept of missing data, and will be able to identify missing datain a non-training dataset (production mode). Similarly, neural nets NN2a and NN2 b may be trained on what constitutes problematic data. Forexample, if a zip code does not closely match with a populated locationfield, it is likely wrong, as it is more likely that the patient cancorrectly identify their city and state.

As yet another example, each neural net NN5 a-NN5 d is trained, apriori, to find specific data (e.g., from environmental dbs, employmentdbs, population dbs, genetic dbs, etc.). Upon meeting specified criteria(e.g., correctly predicting within a specified error rate, whichindividuals among a population of individuals have cancer), the neuralnet may be placed in production mode.

Accordingly, for the purposes of the embodiments provided herein, itwill be generally assumed that the various neural nets are trained witha data set of sufficient size to reach convergence.

After the neural net is trained, the neural net may be exposed to newdata, and its performance may be tested, e.g., with another dataset inwhich the prediction from the neural net may be validated with clinicaldata. Once the neural net has been established to behave withinestablished guidelines, the neural net may be exposed to true unknowndata.

As neural nets are highly adaptive, the specific criteria used to makedecisions to determine a risk score may evolve as a function of time andas new data becomes available. While it may be possible to characterizethe neural net as a function of a particular moment in time, the neuralnet and its corresponding decision making process evolves as a functionof time. Accordingly, data flow within the nodes of the network mayevolve over time as new data is obtained, and as new conclusions arevalidated.

FIG. 3 is a flow diagram showing example operations for cleaninginformation in accordance with an embodiment of the invention. Thisapproach may be utilized to identify patient information in EMR db 10and EMR db 20, as well as correct problematic information, and store thecorrected information in a knowledge store, e.g., clean data KS 80 (see,FIG. 1A). At operation 300, information for a patient that is stored inone or more medical records of a primary Electronic Medical Records(EMR) system is identified. At operation 310, it is determined (e.g.,using Adder neural net NN1), whether additional data (e.g., additionalmedical information from the patient or from family members related tothe patient) stored in one or more secondary EMRs is needed to compute arisk score. If the machine learning system can compute the risk scorewithout additional data, the process may continue operation to operation320. If additional information is needed, at operation 315, theadditional data is obtained. At operation 320, the machine learningsystem identifies (e.g., using neural net NN2 a and NN2 b), one or morefields of patient data from EMR db 10 and EMR db 20 that is problematic(e.g., missing data, wrong data, ambiguous data, etc.) and is to becorrected. In some embodiments, the problematic data to be corrected isranked based upon the potential impact of each identified field to thedetermined risk score. In some embodiments, the highest ranked (highestpotential impact) fields are corrected, and the system may determinethat the calculation may be performed without correcting fields thathave a lower potential impact. At operation 330, the one or moreidentified fields are corrected through one or more outreach processes(e.g., manually, automatically, or both). An outreach process mayinclude contacting another source of information, such as a physician, apatient, another computing system, etc., in order to correct theproblematic data. At operation 340, the machine learning systemdetermines whether the information needs to be anonymized, and if so,the information is anonymized. Otherwise, the process may continue tooperation 350. At operation 350, the anonymized (or corrected)information is stored in clean data knowledge store (KS) 80, where it isready for extraction, e.g., by NN3 “EMR Extractor”.

FIG. 4 shows a flow diagram showing example operations involving masterneural net NN12, according to embodiments of the invention. In thisexample, a plurality of inputs are provided to the master neural net NN12. These inputs include data from the EMR Pt Data Bus 142, as well asfrom dbs 30-60. The master neural net NN12 analyzes the received inputsto determine an individual's risk for having cancer in a population,e.g., a cohort population.

In this example, data from extracted data KS 130 may be provided tomaster neural net NN12, either directly or through one or more otherneural nets. In particular, at operation 400, numeric data may beprovided to NN12 for analysis. In some embodiments, this data may beprovided directly to NN12, wherein each type of data may be weighted asa separate input. Other types of data that undergo processing by otherneural nets may also be provided to neural net NN12. Biomarker (BM)velocity data that has been processed by neural net NN9 at operation 405may be provided to neural net NN12 at operation 410 for analysis. NN9may determine, based on a velocity of biomarker concentration (e.g., arate of increase of one or more biomarkers as a function of time) that apatient is at increased risk for having cancer. At operation 415,unstructured data is provided to NN8 for analysis. At operations 420 and425, numeric data derived from unstructured data as well as theunstructured data itself (both outputs of neural net NN8) may beprovided to neural net NN12 for processing. At operation 430, raw imagedata is provided to NN10 for analysis. At operation 435, the output ofneural net NN10, analyzed image data may be provided to neural net NN12for analysis.

In addition to the data from bus 138, master neural net NN12 may alsoreceive inputs from the publically available databases, as shown inoperations 440-460. At operation 440, scaled risk factors, fromdatabases dbs 30-60, which may be stored in scaled KS 120 are providedas inputs to master neural net NN12. At operation 445, genetic markersare provided to NN7 for analysis and the output is provided to NN12 foranalysis at operation 450. At operation 455, population data in the formof a cohort from neural net NN11 may be generated and provided to neuralnet NN12 for analysis at operation 460.

The above examples are not intended to be limiting with regard to thetypes of inputs that may be provided to NN12. Present inventionembodiments may include any input derived from a patient's medicalinformation or any source of publically available information related toa patient's medical condition.

Once the inputs are received, master neural net NN12 may be utilized toanalyze the information in order to determine whether an individual hasan increased risk for having cancer, as shown at operation 465.

In some embodiments, master neural net NN12 may receive a cohortpopulation from neural net NN11. Upon analyzing the different types ofdata, master NN12 may modify the cohort population to include additionalfactors. For instance, if a cohort population was originally provided byneural net NN11 as male, 50 years of age, and 10-15 pack years, uponconsideration of other risk factors, neural net NN12 may modify thecohort to include additional information, e.g., male, 50 years of age,10-15 pack years, a composite biomarker score greater than a thresholdvalue (or a category indicative of a likelihood of having, or nothaving, cancer), and a specified biomarker having a certain velocity.Thus, the cohort population may evolve as a function of time.

Master neural net NN12 may also generate various types of information asa result of analyzing the various types of input data that have beenprovided. At operation 470, neural net NN12 determines for an individualpatient, an increased risk (e.g., a percentage, a multiplier, or anyother numeric value, etc.) for having cancer relative to a population,e.g., such as a cohort population. A report including the determinedrisk, and information used to determine the risk, e.g., the cohortpopulation, the size of the cohort, etc., as well as relevant statistics(e.g., margin of error) may be provided in the report. The report mayalso include a recommendation that high risk patients undergo morefrequent screening. In some aspects, the recommended time betweenfollow-ups is a function of clinical indicators and the cohortpopulation. Recommendations as to behavioral changes may also beprovided.

Other types of information may be provided to a patient or physician aswell. For example, at operation 474, major risk factors for havingcancer based upon the analysis by neural net NN12 may be reported. Atoperation 472, cancer-specific biomarkers that have been optimized(e.g., most heavily weighted in the risk determination) may be reported.At operation 476, a summary of data used to generate the predicted riskof cancer may be reported. At operation 478, physicians may be rankedaccording to their ability to diagnose early stage cancer. Thetechniques used by these physicians may be evaluated to develop bestpractices for training other physicians in the early diagnosis ofcancer. At operation 480, an optimal BM velocity, which is a cutoffbetween velocities that are not associated with an increased risk ofhaving cancer and velocities that are associated with an increased riskof having cancer (e.g., a threshold, etc.) may be reported.

At operation 482, patient information, regarding whether cancer wasdiagnosed during a follow-up visit, may be written back to the EMRs, inorder to provide continuous feedback to the system.

As neural net NN12 receives data validating or invalidating whether anindividual identified as high risk (as predicted by the neural net) hascancer, neural net NN12 may continue to intrinsically train as afunction of time, in production mode, adjusting input and/or hiddenlayer weights as additional patient data becomes available. Accordingly,by utilizing a feedback loop, in which the difference between predictedresults and the actual results, e.g., confirmed by invasive testing, isfed back into the system as a function of time, the accuracy ofprediction may be improved as additional data is fed into the system.

The embodiments herein may automatically and continuously update therisk scores, the corresponding confidence values/margin of error, basedon evolving data (e.g., medical patient data) in order to provide thehighest confidence answers and recommendations. Rather than providingstatic calculations that always provide the same answers when given thesame input, the embodiments herein continually update as new data isreceived, thereby, providing the physician and patient with the bestmost up-to-date information.

Thus, the embodiments herein provide substantial advantages over systemsthat generate static results based on preset, fixed criteria that israrely revised (or only revised at periodic updates (e.g., softwareupdates)). By acting dynamically, risk scores and recommendations canchange based on evolving demographic changes, evolving medicaldiscoveries, etc., as well as new data within the EMR and publicallyavailable databases. Therefore, the embodiments herein can continuouslyimprove early detection of cancer, and new data becomes available,providing physicians and their patients with an automated system foraccessing the best medical practices and treatments for their patientsas medical advances and demographics change over time.

FIG. 5 shows a flow diagram of example operations for EMR Extractorneural net NN3, according to embodiments of the invention. Clean data KS80 comprises a repository of clean information from EMR db 10 and, asapplicable, EMR db 20. At operation 505, neural net NN3 is utilized toextract data from clean data KS 80. This extracted data may be stored inextracted data KS 130. At operation 510, the extracted data is separatedby type, e.g., raw images 155, biomarker (BM) velocity data 145,text-based unstructured data 140, and numeric/structured data 150. Atoperation 515, it is determined whether additional processing (by otherneural nets) is needed before providing the information to the masterneural net NN12 for analysis. Numeric data 150 may be stored in patientdata KS 135 without additional processing. In this example, theremaining types of data are processed with other neural nets. Raw imagedata 155 is provided to neural net NN10, which analyzes imaging data, atoperation 520. Biomarkers velocity data 145 is provided to the biomarkervelocity neural net NN9, which identifies patterns in biomarker data, atoperation 530. In some embodiments, NN9 may be untrained.

Unstructured data 140 is provided to natural language processing neuralnet NN8, at operation 540, which uses natural language processing andsemantics to analyze unstructured data. The NLP may be applied toanalyze the context of various types of text (e.g., physician notes, labreports, medical history, prescribed treatment, and any other type ofannotation) to determine relevant risk factors, and this information maybe provided as inputs into master NN12. NN8 may also derive numericinputs from the unstructured language, e.g., years of smoking, years offamily members smoking, and any other numeric data at operation 540. Forexample, neural net NN8 may be employed for natural language processingof a written radiology report that accompanies a raw image. With asufficiently large number of training examples, a NLP/deep learningprogram will learn how to interpret a written report relevant to afinding of cancer. In this example, neural net NN8 generates at leasttwo outputs, e.g., text-based data 175 which comprises patienthistories, image reports impressions, etc., as well as converted numericfields 185, e.g., years of smoking, frequency of smoking, etc. Pt dataKS 135 may store data sent to the bus 142 for subsequent input into themaster neural net NN12.

FIG. 6 shows a flow diagram of example operations for neural netsassociated with publically available data, according to embodiments ofthe invention. At operation 610, neural net NN4 is utilized to identifyinformation in the EMR which would benefit from the additional knowledgeobtainable from publically available sources of information.Corresponding questions may be generated, e.g., by a question-answermodule, which are known in the art, and stored in public KS 110 forfuture retrieval. At operation 620, the best in class domain specificknowledge sources are identified and maintained. In this example, domainrefers to a type of publically available information, e.g.,geographic/environmental, employment, population, or genetic database.At operation 630, neural nets NN5 a-d are utilized to query eachrespective domain source, provided that neural net NN4 has identified aneed for that specific domain information. At operation 640, it isdetermined whether data has been extracted from all domain sources andfully evaluated. If not, the process returns to operation 620, andidentification of best in class domain specific knowledge sources isrepeated. In some embodiments, provided that questions have been askedregarding the genetic domain, at operation 645, neural net NN7 isutilized to extract details of relevant genetic defects. The geneticdata may be provided to master neural net NN12 via genetic data 165. Atoperation 650, neural net NN11 is utilized to extract population datafor cohort analysis, and the extracted data, population/cohort data isprovided to neural net NN 12 for analysis. At operation 655, neural netNN6 a-d is utilized to scale (or weight) the answers provided in eachrespective domain. It is understood that weights in one domain may notbe equivalent in terms of weights in another domain, e.g., a ‘9’ in theenvironmental domain may not be equivalent to a ‘9’ in the geneticdomain. At operation 660, scaled data is loaded from the dbs 30-60 ontothe scaled bus 70. The scaled data may be stored in scaled KS 120 forfuture use.

In some embodiments, as new data becomes available for a patient, thesystem recomputes the risk score and provides the result to thephysician.

In many domains, the answer with the highest confidence need not be theappropriate answer because there can be several possible explanationsfor a problem.

As will be appreciated by one skilled in the art, aspects of theembodiments herein may be embodied as a system, method or computerprogram product. Accordingly, aspects of the embodiments herein may takethe form of an entirely hardware embodiment, an entirely softwareembodiment (including firmware, resident software, micro-code, etc.) oran embodiment combining software and hardware aspects that may allgenerally be referred to herein as a “circuit,” “module” or “system.”Furthermore, aspects of the embodiments herein may take the form of acomputer program product embodied in one or more computer readablemedium(s) having computer readable program code embodied thereon.

Any combination of one or more computer readable medium(s) may beutilized. The computer readable medium may be a computer readable signalmedium or a computer readable storage medium. A computer readablestorage medium may be, for example, but not limited to, an electronic,magnetic, optical, electromagnetic, infrared, or semiconductor system,apparatus, or device, or any suitable combination of the foregoing. Morespecific examples (a non-exhaustive list) of the computer readablestorage medium would include the following: an electrical connectionhaving one or more wires, a portable computer diskette, a hard disk, arandom access memory (RAM), a read-only memory (ROM), an erasableprogrammable read-only memory (EPROM or Flash memory), an optical fiber,a portable compact disc read-only memory (CD-ROM), an optical storagedevice, a magnetic storage device, or any suitable combination of theforegoing. In the context of this document, a computer readable storagemedium may be any tangible medium that can contain, or store a programfor use by or in connection with an instruction execution system,apparatus, or device.

A computer readable signal medium may include a propagated data signalwith computer readable program code embodied therein, for example, inbaseband or as part of a carrier wave. Such a propagated signal may takeany of a variety of forms, including, but not limited to,electro-magnetic, optical, or any suitable combination thereof. Acomputer readable signal medium may be any computer readable medium thatis not a computer readable storage medium and that can communicate,propagate, or transport a program for use by or in connection with aninstruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmittedusing any appropriate medium, including but not limited to wireless,wireline, optical fiber cable, RF, etc., or any suitable combination ofthe foregoing.

FIGS. 11 and 12 are flow diagrams of example processes for utilizing amachine learning system to classify an individual patient into a riskcategory, e.g., based upon a risk score. FIG. 11 involves constructing acohort population, while FIG. 12 involves classification of anindividual patient.

Referring to FIG. 11, at operation 2005, biomarker values and a medicalhistory are received for an individual patient (e.g., at neural netNN12). At operation 2010, a machine learning system (e.g., neural netNN11) is used to identify a cohort population relative to the individualpatient, based upon information (e.g., biomarker values, medicalhistory, positive or negative diagnosis, etc.) from a large volume ofpatients (e.g., from population db 50). By providing biomarker valuesand the medical history of the individual patient to neural net NN11,the neural net can determine a cohort population.

At operation 2020, a machine learning system may be used to identifyparameters (e.g., risk factors, corresponding weightings, etc.) todivide the cohort population into a plurality of categories, eachcategory representing a level of risk of having a disease.

The machine learning system may not know, a priori, which parameters(e.g., risk factors) are most predictive of having lung cancer.Accordingly, the neural net may determine these parameters using aniterative process, until specified criteria are met (e.g., having aspecified percentage of a population of individuals that have beendiagnosed as having cancer, classified within the highest riskcategory). The neural net may refine the parameters (e.g., risk factors,weightings, etc.) until meeting specified criteria.

In some aspects, neural net NN11 may perform clustering (e.g., usingstatistical clustering techniques, etc.) on the cohort population toidentify risk factors, e.g., based on medical information from the largevolume of patients. For example, by performing clustering on age, theneural net NN11 may determine that individuals between 45-50 are mostlikely to have cancer, (e.g., first diagnosis). Other parameters may beselected in a similar manner. Accordingly, the machine learning systemmay select an initial set of parameters, e.g., an age/age range, asmoking history (in terms of years and/or packs per year) for analysis,and assign an initial weighting for each parameter. Accordingly, byusing clustering or other grouping/analytical techniques, predictiveparameters may be identified.

At operation 2025, patients (e.g., in some aspects, each patient of thelarge volume of patients) are classified into a category of the cohortpopulation based on the risk score. At operation 2040, it is determinedwhether the classification of the patients meet specified criteria bycomparing with known classifications of the patients. As the informationfrom the large volume of patients includes a diagnosis of having or nothaving cancer, the classifications/risk scoring by the neural net may beevaluated for accuracy. For example, a majority of patients that do nothave cancer should have a high risk score and be classified as highrisk, while a majority of patients that do have cancer should have a lowrisk score and be categorized as low risk.

At operation 2050, if the classification (by risk score) meet specifiedcriteria (e.g., within a specified error rate, margin of error,confidence interval, etc.) then the process may proceed to block “A” inFIG. 12. Otherwise, at operation 2070, the machine learning system willselect a revised set of parameters (e.g., the revised parameters mayinclude new fields of medical information, altered weighting for eachfield, etc.) to construct a risk score for classification. For example,if age and smoking history were originally used, a revised set ofparameters may be constructed using age, smoking history, and biomarkervalues. As another example, if age and smoking history were originallyused to determine a risk score, a revised set of parameters may beconstructed using a decreased weighting for age, and an increasedweighting for smoking history.

At operation 2080, categories of the cohort population are constructedusing the revised set of parameters, and the process continues tooperation 2025. Operations 2025-2080 may repeated until reachingspecified criteria.

Referring to FIG. 12, at operation 2110, the machine learning system isutilized to classify (via a risk score) the individual patient into acategory of the cohort population (high risk, medium risk, low risk). Atoperation 2120, additional medical information is received for theindividual patient, indicating whether the individual patient has thedisease (e.g., cancer). At operation 2130, a determination is made as towhether the classification of the individual patient is consistent withthe additional medical information (e.g., the diagnosis of whether ornot the patient has cancer). If the classification is consistent, atoperation 2140, with the additional medical information, then theprocess may end. Otherwise, if the results are not consistent, themachine learning system selects a revised set of parameters (e.g., theparameters may include new fields of medical information, alteredweighting for each field, etc.) for the cohort population at operation2150. For example, a new field could be added to select a new cohort(e.g., a new biomarker) or the weights of the inputs into the neural netNN11 may be adjusted. At operation 2160, categories of the cohortpopulation are constructed based upon the revised set of parameters (byassigning a corresponding risk score), the individual patient may beclassified into a category of the cohort population, and the processiterates through operations 2130-2160 until reaching agreement.

Thus, neural networks are adaptive systems. Through a process oflearning by example, rather than conventional programming by differentcases, neural networks are able to evolve in response to new data. It isalso noted that algorithms for training artificial neural networks(e.g., gradient descent, cost functions, etc.) are known in the art andwill not be covered in detail herein.

Computer program code for carrying out operations for aspects of theembodiments herein may be written in any combination of one or moreprogramming languages, including an object oriented programming languagesuch as Java, Smalltalk, C++ or the like and conventional proceduralprogramming languages, such as the “C” programming language or similarprogramming languages. The program code may execute entirely on theuser's computer, partly on the user's computer, as a stand-alonesoftware package, partly on the user's computer and partly on a remotecomputer or entirely on the remote computer or server. In the latterscenario, the remote computer may be connected to the user's computerthrough any type of network, including a local area network (LAN) or awide area network (WAN), or the connection may be made to an externalcomputer (for example, through the Internet using an Internet ServiceProvider).

Aspects of the embodiments herein are described below with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems) and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer program instructions. These computer program instructions maybe provided to a processor of a computer, special purpose computer, orother programmable data processing apparatus to produce a machine, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, create means forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks.

These computer program instructions may also be stored in a computerreadable medium that can direct a computer, other programmable dataprocessing apparatus, or other devices to function in a particularmanner, such that the instructions stored in the computer readablemedium produce an article of manufacture including instructions whichimplement the function/act specified in the flowchart and/or blockdiagram block or blocks. The computer program instructions may also beloaded onto a computer, other programmable data processing apparatus, orother devices to cause a series of operational steps to be performed onthe computer, other programmable apparatus or other devices to produce acomputer implemented process such that the instructions which execute onthe computer or other programmable apparatus provide processes forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods and computer program products according to variousembodiments herein. In this regard, each block in the flowchart or blockdiagrams may represent a module, segment, or portion of code, whichcomprises one or more executable instructions for implementing thespecified logical function(s). It should also be noted that, in somealternative implementations, the functions noted in the block may occurout of the order noted in the figures. For example, two blocks shown insuccession may, in fact, be executed substantially concurrently, or theblocks may sometimes be executed in the reverse order, depending uponthe functionality involved. It will also be noted that each block of theblock diagrams and/or flowchart illustration, and combinations of blocksin the block diagrams and/or flowchart illustration, can be implementedby special purpose hardware-based systems that perform the specifiedfunctions or acts, or combinations of special purpose hardware andcomputer instructions.

It is understood in advance that although this disclosure includes adetailed description on cloud computing, implementation of the teachingsrecited herein are not limited to a cloud computing environment. Rather,embodiments herein are capable of being implemented in conjunction withany other type of computing environment now known or later developed.Cloud computing is a model of service delivery for enabling convenient,on-demand network access to a shared pool of configurable computingresources (e.g. networks, network bandwidth, servers, processing,memory, storage, applications, virtual machines, and services) that canbe rapidly provisioned and released with minimal management effort orinteraction with a provider of the service. This cloud model may includeat least five characteristics, at least three service models, and atleast four deployment models. Characteristics are as follows:

On-demand self-service: a cloud consumer can unilaterally provisioncomputing capabilities, such as server time and network storage, asneeded automatically without requiring human interaction with theservice's provider.

Broad network access: capabilities are available over a network andaccessed through standard mechanisms that promote use by heterogeneousthin or thick client platforms (e.g., mobile phones, laptops, and PDAs).

Resource pooling: the provider's computing resources are pooled to servemultiple consumers using a multi-tenant model, with different physicaland virtual resources dynamically assigned and reassigned according todemand. There is a sense of location independence in that the consumergenerally has no control or knowledge over the exact location of theprovided resources but may be able to specify location at a higher levelof abstraction (e.g., country, state, or datacenter).

Rapid elasticity: capabilities can be rapidly and elasticallyprovisioned, in some cases automatically, to quickly scale out andrapidly released to quickly scale in. To the consumer, the capabilitiesavailable for provisioning often appear to be unlimited and can bepurchased in any quantity at any time.

Measured service: cloud systems automatically control and optimizeresource use by leveraging a metering capability at some level ofabstraction appropriate to the type of service (e.g., storage,processing, bandwidth, and active user accounts). Resource usage can bemonitored, controlled, and reported providing transparency for both theprovider and consumer of the utilized service. Service Models are asfollows: Software as a Service (SaaS): the capability provided to theconsumer is to use the provider's applications running on a cloudinfrastructure. The applications are accessible from various clientdevices through a thin client interface such as a web browser (e.g.,web-based e-mail). The consumer does not manage or control theunderlying cloud infrastructure including network, servers, operatingsystems, storage, or even individual application capabilities, with thepossible exception of limited user-specific application configurationsettings.

Platform as a Service (PaaS): the capability provided to the consumer isto deploy onto the cloud infrastructure consumer-created or acquiredapplications created using programming languages and tools supported bythe provider. The consumer does not manage or control the underlyingcloud infrastructure including networks, servers, operating systems, orstorage, but has control over the deployed applications and possiblyapplication hosting environment configurations.

Infrastructure as a Service (IaaS): the capability provided to theconsumer is to provision processing, storage, networks, and otherfundamental computing resources where the consumer is able to deploy andrun arbitrary software, which can include operating systems andapplications. The consumer does not manage or control the underlyingcloud infrastructure but has control over operating systems, storage,deployed applications, and possibly limited control of select networkingcomponents (e.g., host firewalls).

Deployment Models are as follows:

Private cloud: the cloud infrastructure is operated solely for anorganization. It may be managed by the organization or a third party andmay exist on-premises or off-premises. Community cloud: the cloudinfrastructure is shared by several organizations and supports aspecific community that has shared concerns (e.g., mission, securityrequirements, policy, and compliance considerations). It may be managedby the organizations or a third party and may exist on-premises oroff-premises.

Public cloud: the cloud infrastructure is made available to the generalpublic or a large industry group and is owned by an organization sellingcloud services.

Hybrid cloud: the cloud infrastructure is a composition of two or moreclouds (private, community, or public) that remain unique entities butare bound together by standardized or proprietary technology thatenables data and application portability (e.g., cloud bursting forload-balancing between clouds).

A cloud computing environment is service oriented with a focus onstatelessness, low coupling, modularity, and semantic interoperability.At the heart of cloud computing is an infrastructure comprising anetwork of interconnected nodes.

Referring now to FIG. 7, an example of computing environment thatincludes a computing node for an artificial intelligence system isshown. In some embodiments, the node may be a stand-alone (single)computing node. In some embodiments, the node may be implemented in acloud-based computing environment. In other embodiments, the node may beone of a plurality of nodes in a distributed computing environment.Accordingly, computing node 740 is only one example of a suitableartificial intelligence computing node and is not intended to suggestany limitation as to the scope of use or functionality of embodiments ofthe invention described herein.

Regardless, computing node 740 is capable of being implemented and/orperforming any of the functionality set forth hereinabove. In cloudcomputing node 740 there is a computer server/node 740, which isoperational with numerous other computing system environments orconfigurations. Examples of well-known computing systems, environments,and/or configurations that may be suitable for use with server/node 740include, but are not limited to, personal computer systems, servercomputer systems, thin clients, thick clients, hand-held or laptopdevices, multiprocessor systems, microprocessor-based systems, set topboxes, programmable consumer electronics, network PCs, minicomputersystems, mainframe computer systems, and distributed cloud computingenvironments that include any of the above systems or devices, and thelike.

Computer server/node 740 may be described in the general context ofcomputer system executable instructions, such as program modules, beingexecuted by a computer system. Generally, program modules may includeroutines, programs, objects, components, logic, data structures, and soon that perform particular tasks or implement particular abstract datatypes. Server/node 740 may be practiced in distributed cloud computingenvironments where tasks are performed by remote processing devices thatare linked through a communications network. In a distributed cloudcomputing environment, program modules may be located in both local andremote computer system storage media including memory storage devices.

FIG. 7 shows an example computing environment according to embodimentsof the invention. The components of server/node 740 may include, but arenot limited to, one or more processors or processing units 744, a systemmemory 748, a network interface card 742, and a bus 746 that couplesvarious system components including system memory 748 to processor 744.Bus 746 represents one or more of any of several types of busstructures, including a memory bus or memory controller, a peripheralbus, an accelerated graphics port, and a processor or local bus usingany of a variety of bus architectures. By way of example, and notlimitation, such architectures include Industry Standard Architecture(ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA)bus, Video Electronics Standards Association (VESA) local bus, andPeripheral Component Interconnects (PCI) bus. Computer server/node 740typically includes a variety of computer system readable media. Suchmedia may be any available media that is accessible by computerserver/node 740, and it includes both volatile and non-volatile media,removable and non-removable media.

System memory 748 can include computer system readable media in the formof volatile memory, such as random access memory (RAM) 750 and/or cachememory 755. Computer system/server 740 may further include otherremovable/non-removable, volatile/non-volatile computer system storagemedia. By way of example only, storage system 760 can be provided forreading from and writing to a non-removable, non-volatile magnetic media(not shown and typically called a “hard drive” or solid state drive).Although not shown, a magnetic disk drive for reading from and writingto a removable, non-volatile magnetic disk (e.g., a “floppy disk”), andan optical disk drive for reading from or writing to a removable,non-volatile optical disk such as a CD-ROM, DVD-ROM or other opticalmedia can be provided. In such instances, each can be connected to bus746 by one or more data media interfaces. As will be further depictedand described below, memory 748 may include at least one program producthaving a set (e.g., at least one) of program modules that are configuredto carry out the functions of embodiments of the invention.Program/utility 770, having a set (at least one) of program modulescorresponding to one or more elements of NACS 100, may be stored inmemory 748 by way of example, and not limitation, as well as anoperating system 780, one or more application programs, other programmodules, and program data. Each of the operating system, one or moreapplication programs, other program modules, and program data or somecombination thereof, may include an implementation of a networkingenvironment. Program modules for NACS 100 generally carry out thefunctions and/or methodologies of embodiments of the invention asdescribed herein.

Computer server node 740 may also communicate with a client device 710.Client device 710 may have one or more user interfaces 718 such as akeyboard, a pointing device, a display, etc., one or more processors714, and/or any devices (e.g., network card 712, modem, etc.) thatenable the client device 710 to communicate with computer server/node740 to communicate with client device 710. Still yet, computerserver/node 740 can communicate with client 710 over one or morenetworks 725 such as a local area network (LAN), a wide area network(WAN), and/or a public network (e.g., the Internet) via networkinterface card 742. As depicted, network interface card 742 communicateswith the other components of computer server/node 740 via bus 746. Itshould be understood that although not shown, other hardware and/orsoftware components can be used in conjunction with computer server/node740. Examples, include, but are not limited to: microcode, devicedrivers, redundant processing units, external disk drive arrays, RAIDsystems, tape drives, and data archival storage systems, etc. One ormore databases 730 may store data accessible by NACS 100.

In some embodiments, NACS 100 may run on a single server node 740. Inother embodiments, NACS 100 may be distributed across a plurality ofmultiple nodes, wherein a master computing node provides workloads to aplurality of slave nodes (not shown).

Referring now to FIG. 8, illustrative cloud computing environment 800 isdepicted. As shown, cloud computing environment 800 comprises one ormore cloud computing nodes 805 with which local computing devices usedby cloud consumers, such as, for example, personal digital assistant(PDA) or cellular telephone 810, desktop computer 815, laptop computer820 may communicate. Nodes 805 may communicate with one another. Theymay be grouped (not shown) physically or virtually, in one or morenetworks, such as Private, Community, Public, or Hybrid clouds asdescribed hereinabove, or a combination thereof. This allows cloudcomputing environment 800 to offer infrastructure, platforms and/orsoftware as services for which a cloud consumer does not need tomaintain resources on a local computing device. It is understood thatthe types of computing devices 810-820 shown in FIG. 8 are intended tobe illustrative only and that computing nodes 805 and cloud computingenvironment 800 can communicate with any type of computerized deviceover any type of network and/or network addressable connection (e.g.,using a web browser).

Referring now to FIG. 9, a set of functional abstraction layers providedby cloud computing environment 800 (FIG. 8) is shown. It should beunderstood in advance that the components, layers, and functions shownin FIG. 9 are intended to be illustrative only and embodiments of theinvention are not limited thereto. As depicted, the following layers andcorresponding functions are provided: Hardware and software layer 910includes hardware and software components. Examples of hardwarecomponents include mainframes, RISC (Reduced Instruction Set Computer)architecture based servers; storage devices; networks and networkingcomponents. Examples of software components include network applicationserver software, application server software; and database software.Virtualization layer 920 provides an abstraction layer from which thefollowing examples of virtual entities may be provided: virtual servers;virtual storage; virtual networks, including virtual private networks;virtual applications and operating systems; and virtual clients. In oneexample, management layer 930 may provide the functions described below.Resource provisioning provides dynamic procurement of computingresources and other resources that are utilized to perform tasks withinthe cloud computing environment. Other functions provide cost trackingas resources are utilized within the cloud computing environment. In oneexample, these resources may comprise application software licenses.Security provides identity verification for cloud consumers and tasks,as well as protection for data and other resources. User portal providesaccess to the cloud computing environment for consumers and systemadministrators.

Workloads layer 940 provides examples of functionality for which thecloud computing environment may be utilized. Examples of workloads andfunctions which may be provided from this layer include: data analyticsprocessing; neural net analytics, etc.

Referring to FIG. 14, a flowchart is provided that describes generatingan artificial neural net (ANN) to predict a likelihood of having cancer.At operation 2310, data preprocessing may occur (e.g., normalization ofdata). In some embodiments, the concentration values of each biomarkerand clinical data may be pre-processed numerically before being providedas input into the ANN. For example, the values may be normalized to havea mean equal to 0 and a standard deviation of 1. The normalized data maybe randomized before being provided as inputs into the ANN. At operation2320, the test data set is divided into test/training data andvalidation data, e.g., 70% for the training phase, and 30% for thevalidation phase. At operation 2330, parameters are selected (e.g.,number of hidden layers, number of nodes, inputs, outputs,transfer/activation functions, etc.) and the corresponding architectureis created for the system.

At operation 2340, the training/test data is used to train the systemand generate a classifier. The initial weights between each connectionand the bias of the ANN is set at the beginning, e.g., in a randomizedmanner, and during training the weights are adjusted by a learningfunction. Criteria are selected to stop the training phase in an ANN,e.g., when the root-mean-square error is less than a threshold or whenthe correct classification rate meets a threshold. The values of thebiomarkers and clinical data are directly involved in the modificationof the connection weights in the ANN model during training. Methods foravoiding cross-fitting are also applied.

Once the training process is complete, two operations are performed: (1)at operation 2345, the output error and rate classification aredetermined; and (2) at operation 2350, the sensitivity and specificityare determined. If the sensitivity and specificity meet desiredperformance criteria (e.g., a threshold such as at least 70% sensitivitywith an 80% specificity), the training ceases, at operation 2360. On theother hand, if performance criteria is not met, then the parameters areadjusted at operation 2330, and the classifier is retrained using theadjusted parameters at operation 2340.

Provided that the sensitivity and specificity performance criteria aremet (e.g., threshold(s)), the neural net is saved at operation 2370. Insome embodiments, multiple neural nets may meet specified criteria andmay be saved, with the best performing neural net and its associatedparameters subsequently selected, e.g., for use in a clinical setting.

In some embodiments, the optimal ANN architecture is chosen based on themean squared error training and the best classification percentage. Todetermine which ANN architecture is most appropriate for the dataset ofconcentration of biomarkers and clinical parameters, various ANNs withdistinct configurations may be tested, including with one hidden layer(with 1, 2, 3 nodes, etc.), two hidden layers (with differentcombinations i.e. 3-2, 5-3, 2-6, etc. nodes) or three hidden layers.Only the ANNs that present the best ability to correctly classify thelargest possible number of data are chosen and are saved. The neural netis then used to classify the data from the validation phase, and theprediction power and sensitivity and specificity are determined (see,operations 2380-2390).

Once the sensitivity and specificity meet desired performance atoperation 2390, then the neural net is selected for prediction of cancerat operation 2395. This version may be static, semi-static orcontinuously updated.

In some embodiments, this configuration may be made static, meaning thatthe neural net is not refined based on collection of additional data,and is deployed, e.g., to a doctor's office, for use in determining alikelihood of cancer in a patient. In still other embodiments, theneural net is continually refined based on collection of additionaldata, and when deployed, e.g., to a doctor's office or a remote serverfor use in estimating a likelihood of cancer in a patient, the model iscontinuously updated as more data becomes available. In still otherembodiments, this configuration may be periodically updated, e.g.,according to a prescribed schedule.

The terminology used herein is for the purpose of describing particularembodiments only and is not intended to be limiting with respect to aparticular embodiment of the present invention. As used herein, thesingular forms “a”, “an” and “the” are intended to include the pluralforms as well, unless the context clearly indicates otherwise. It willbe further understood that the terms “comprises” and/or “comprising,”when used in this specification, specify the presence of statedfeatures, integers, steps, operations, elements, and/or components, butdo not preclude the presence or addition of one or more other features,integers, steps, operations, elements, components, and/or groupsthereof.

The corresponding structures, materials, acts, and equivalents of allmeans or step plus function elements in the claims below are intended toinclude any structure, material, or act for performing the function incombination with other claimed elements as specifically claimed. Thedescription of the embodiments herein has been presented for purposes ofillustration and description, but is not intended to be exhaustive orlimited to the embodiments disclosed herein. Many modifications andvariations will be apparent to those of ordinary skill in the artwithout departing from the scope and spirit of the invention. Theembodiment was chosen and described in order to best explain theprinciples of the invention and the practical application, and to enableothers of ordinary skill in the art to understand the invention forvarious embodiments with various modifications as are suited to theparticular use contemplated.

In a further exemplary embodiment, the decision-support applicationdescribed herein is applied to the early detection of cancer. In oneaspect, the decision-support application utilizes data from bloodbiomarkers, patent medical records, epidemiological factors associatedwith increased or decreased lung cancer risk gathered from the medicalliterature, clinical factors associated with increased or decreased lungcancer risk gathered from the medical literature, and analyses ofpatient x-rays and other images generated by various scanning techniqueswell known in the art in concert with information gathered from thequestion-answering system in order to determine a patient's cancer riskrelative to an appropriate matched cohort. In a further aspect, thisdetermination is improved over time utilizing machine learning toimprove the algorithm based upon prior results.

In a further aspect, the medical images include, but are not limited tox-ray based techniques (conventional x-rays, computed tomography (CT),mammography, and use of contrast agents), molecular imaging using avariety of radiopharmaceuticals to visuals biological processes,magnetic imaging (MRI) and ultrasound.

In a further aspect, the NACS 100 described herein provides a patient'slung cancer risk as well as an assessment of the likelihood of othernon-cancer lung diseases. For example, the application may assess thelikelihood of COPD, asthma, or other disorders. In a further aspect, theapplication described herein may provide an assessment of a patient'srisk of multiple cancers simultaneously. In a further aspect, theapplication may also provide a list of potential tests that may increasethe confidence value for each potential assessed risk as well as toincrease or decrease the assessed risk as a result of the new data.

In a further aspect, the clinical and epidemiological factors that maybe analyzed to assess a patient's relative risk of lung cancer include,but are not limited to disease symptoms like persistent cough, bloodycough or unexpected weight loss, radiological results like suspiciousfindings from chest x-rays or CT scans, and environmental factors likeamount of exposure to air pollution, radon, asbestos, or second handsmoke, history of smoking both in terms of time and intensity of use,and family history of lung cancer.

In a further exemplary embodiment, the machine learning applicationdescribed herein provides results in a secured, cloud-based physicianportal.

One of skill in the art recognizes that the embodiments disclosed hereinmay be practiced with any advanced application capable of machinelearning and natural language processing.

All references cited herein are incorporated by reference in theirentirety.

EXAMPLES

The Examples below are given so as to illustrate the practice of one ormore embodiments of the invention, and are intended to be non-limitingwith regard to the embodiments presented herein.

Example 1 Training Neural Analysis of Cancer System (NACS) with a LargeDataset

Biomarker data from tens of thousands of patients (about 41,000participants) was collected in a study from Taiwan (Wen, Y. H., “Cancerscreening through a multi-analyte serum biomarker panel during healthcheck-up examinations: Results from a 12-year experience” ClinicaChimica Acta 450 (2015) 273-276). Tumor markers AFP, CA 15-3, CA125, PSASC, CEA were assayed using kits from Abbott Diagnostics. Tumor markersCYFRA 21-1 and CA 19-9 were assayed using kits from Roche Diagnostics.Tumor marker CEA was assayed using kits from Siemens Healthcare. Thisdata set may be used as a training data set for NACS 100. This patientdata, coupled with comparable data from one or more other jurisdictionsfor geographic and genetic diversity, is stored in one or moreelectronic medical records databases (e.g., EMR db 10) along with theclinical outcomes i.e. whether cancer was detected within about one yearof biomarker testing and, if so, the type of cancer. Training data fromWen et al. will be particularly useful for pan-cancer screening (i.e.testing of asymptomatic patients for an array of tumor types includingpancreas, liver, and prostate).

Biomarker data from thousands of patients (about 3,000 patients) wasalso collected in a study from Barcelona Spain (Molina, R., “Assessmentof a Combined Panel of Six Serum Tumor Markers for Lung Cancer” Am. J.Respir. Crit. Care Med. (2015)]. In this study, tumor markers CEA, CA15.3, CYFRA 21.1 and NSE were assayed using kits from Roche, and SCC andProGRP were assayed using kits from Abbott Diagnostics. This data setmay also be used as training data set for NACS 100.

This patient data, coupled with comparable data from one or more otherjurisdictions for geographic and genetic diversity, is stored in one ormore electronic medical records databases (e.g., EMR db 10) along withthe clinical outcomes i.e. whether lung cancer was detected within aboutone year of biomarker testing. Training data from the Molina et al. willbe particularly useful for aiding in lung cancer diagnosis when patientshave vague or ambiguous signs or symptoms of lung cancer (e.g. cough,chest pain, etc.).

Corresponding patient medical information/history may also be stored inEMR db 10, such that for each patient participating in the study, one ormore of the following type of data or parameters are also present: age,smoking history, gender, family history (e.g., whether a first degreerelative has been diagnosed with cancer before age 50, etc.) andsymptoms (e.g., unexplained weight loss, fatigue, persistent cough,abdominal pain, chest pain, etc.). Typically, data from a large volumeof patients is needed for sufficient training of NACS 100.

The Neural Analysis of Cancer System (NACS) 100 may access this data,using for example neural nets NN2 a and NN2 b, to determine whether thedata is clean, e.g., whether there is any missing, problematic orconflicting data. Missing data may be ranked according to potentialimpact on the risk score, and data of high impact is corrected. NACS 100makes a determination as to whether sufficient information is availableto determine a risk score, and if so, the system proceeds with analysisof the data. Data is anonymized, as needed.

Once the data is sufficiently clean, neural net NN3 extracts the dataand separates the data according to data type. In some embodiments, datamay be separated into unstructured data 140 (e.g., text-based dataincluding physician notes, etc.), clinical and numeric data 150 (e.g.,symptoms, age, gender, smoking history, family history, etc.). In theevent that imaging data 155 and biomarker velocity 145 information ispresent, these two types of data may be separated as well. Clinical andnumeric data 150 is provided to master neural net NN12; biomarkervelocity 145 is provided to neural net NN9 for analysis, unstructureddata 140 is provided to neural net NN8 for analysis, and imaging data isprovided to NN10 for imaging analysis. The output of the neural netsNN8, NN9 and NN10 are provided to NN12 for analysis.

Neural net NN11 analyzes the dataset to determine parameters forconstructing a cohort population. Various statistical techniques, e.g.,clustering, etc., may be used as part of this analysis. In someembodiments, NACS 100 may determine a cohort population based upon oneor more inputs provided, e.g., an age or age range, smoking history,gender, etc. NACS 100, e.g., master neural net NN12, analyzes thevarious inputs, including the clinical and numeric data (includingbiomarker data), unstructured data, imaging and biomarker velocity dataas available, and generates risk categories corresponding to a level ofrisk (based upon a risk score) for developing cancer. These riskcategories can be used to determine a level of risk for individualpatients, as set forth in the example below.

Example 2 Using NACS to Determine a Risk of the Presence of Lung Cancer

Data from an individual patient may be collected, e.g., via a webapplication form, such as the example form provided in Table A. Patientinformation including clinical/numeric demographic data, imagingdiagnostics and corresponding text notes as well as biomarker data maybe collected via the web application and stored in an electronic recordsdb.

TABLE A FIELD NOTES TYPE Specimen Clinical Collection Site Text fieldPatient ID Letters + numbers Sample ID Letters + numbers Serum SampleCollection Date Numbers Patient Patient Age Numbers Information GenderChoose from Drop- Male/Female down Ethnicity Choose from Asian, Drop-African, Caucasian down Smoking Status Choose from Current Drop-smoker/Former smoker down Cigarettes/Day IF CURRENT: Number NumbersSmoking Duration IF CURRENT: Number Numbers Age Quit If former: NumberNumbers Years since quitting If former: Number Numbers Family History oflung YES/NO Drop- cancer down Symptoms YES/NO Drop- down List symptomsIf YES; free text Text field field Concomitant illness at the YES/NODrop- time of blood draw: down Lung diseases If YES Text field Otherdiseases If YES Text field Concomitant medication at YES/NO Drop- thetime of blood draw down List Text field Clinical Imaging diagnosticsYES/NO Drop- Diagnosis performed down Name of the test If YES, chosefrom CT, Drop- LDCT, X-ray, Other down Date If YES Numbers NodulesYES/NO Drop- down Size of nodules Numbers Number of nodules NumbersNodules characteristics - YES/NO Drop- Margins regular down Nodulescharacteristics - YES/NO Drop- Round glass appearance down Nodulescharacteristics - YES/NO Drop- Calcifications down Other benign diseaseYES/NO Drop- down Name If YES, text field Text field Invasive procedureperformed YES/NO Drop- down Name of the procedure If YES, chose fromDrop- biopsy, VATS, open down chest surgery, other Date of surgeryNumbers Lung Cancer YES/NO Drop- down TNM Letters + numbers AJCC/UICCStage Group Letters + numbers Histological Subtype Text field Metastasispresent YES/NO Drop- down List sites of Text metastasis field Otherbenign disease YES/NO Drop- down Name If YES Text field Clinical TestInstrument Choose from Drop- Results - ARCHITECT or Elecsys downBIOMARKER Test Name Text A field Sample receiving date Numbers Test DateNumbers Test Units Letters and symbols Value Numbers

Based upon the information collected from this form, NACS 100 cananalyze this data, determine a cohort population (from the training dataset), construct categories of risk, and generate a corresponding riskscore for the patient. Based upon which category the patient isclassified into, from the risk score, a likelihood of having cancer canbe calculated.

Thus, as an output, a report may be generated by NACS 100 indicating anindividual patient's risk with respect to a patient cohort. The risk maybe reported as a percentage, a multiplier or any equivalent. The reportmay also list a margin of error, e.g., a 72% chance plus or minus 10%.

Generally, the report will list the parameters used to construct thecohort population. For example, if NACS 100 determines that theparameters for the cohort are gender, age range, family history, andsmoking history, then the report lists cohort parameters as e.g., Male,Age 50-60, 10 year smoking history with 2 packs per day, relative(father) died at age 60 of lung cancer. It is understood that thesecohort parameters are an example, and that many other sets of cohortparameters may be selected by NACS 100, e.g., based upon any combinationof inputs into the system.

In some embodiments, a cohort size is provided, e.g., the cohort may be525 individuals. Also, a list of genetic risk factors may be provided,e.g., mutations from genetic testing, e.g., [EGFR, KRAS], a familyhistory, and biomarker scores [biomarker and corresponding concentration(if applicable), e.g., CYFRA 8 ng/ml, CA 15-3 45 U/ML].

Thus, biomarker data from an individual patient may be supplied to NACS100, and NACS 100 may analyze the data (e.g., clinical and numeric data,symptoms, etc.) to output a report of a patient's predicted likelihoodof having cancer.

Example 3 Training of the Artificial Neural Network (ANN)

There are many different types of ANNs that can be used to model orpredict data where the correlation between dependent and independentvariables is nonlinear or difficult to fit to an equation. For example,there are at least 25 different types of ANNs, wherein each type mayprovide different results based on different selected parameters,including but not limited to: training algorithms, activation/transferfunctions, architectures (e.g., one-, two-, three- or more hiddenlayers; one, two, three or more inputs as part of the input layer; one,two, three or more outputs as part of the output layer).

In this example, the flowchart of FIG. 14 was employed to train theartificial neural network. A Feedforward Network, Pattern RecognitionNetwork was selected as the specific type of neural network used toclassify cancer patients and control subjects. The software used todesign the ANN in this example was MATLAB™. However, any suitablesoftware may be used.

To train the ANN, the biomarkers CA 19-9, CEA, Cyfra 21-1, NSE, Pro-GRP,SCC, and the clinical parameters: smoking status, packages years,patient age, family history of lung cancer, and recoded has symptomsfrom 344 patients with newly diagnosed lung cancer and 105 subjects athigh risk for developing lung cancer but with no history of lung diseasewere used as inputs to the network. For symptoms, the variable (recodedhas symptoms) frequently had a high rate of missing information, in somecases with more than 90% of patient data missing. In some embodiments,CEA, NSE, CYFRA 21-1, CA19-9 were tested by a Roche device and Pro-GRPand SCC were tested by an Abbott Architect i2000 device. Manufacturer'scutoffs for the 6 biomarkers were used, e.g., CEA>=5 ug/L; CYFRA21-1>=3.3 ug/L; NSE>=25 ug/L; SCC>=2 ug/L; CA19-9>=37 u/mL andPro-GRP>=50 ng/L.

An example of input data to a neural network is provided in thefollowing Table 1.

TABLE 1 Family Patient History Recoded Smoking Pack- Age At Lung hasStatus yr Exam Cancer symptoms CA 19-9 CEA CYFRA NSE Pro-GRP SCCNoduleSize 1 50 74 0 0 41 6 12 14 37 2 45 1 50 77 0 0 13 5 8 15 46 2 501 50 69 0 1 9 3 4 14 45 2 23 1 20 64 0 1 30 9 2 11 47 2 20 1 50 62 0 1 54 3 21 52 2 55 1 40 61 0 0 11 2 39 17 20 8 63 1 20 74 0 0 1 2 2 14 27 125 1 20 66 0 0 9 2 1 24 934 1 35 1 20 62 0 1 10 2 4 12 56 1 42 1 20 62 00 10 3 4 15 27 1 68 0 20 65 0 0 191 2 3 9 52 1 25 1 20 71 1 0 28 2 2 1024 0 27 0 20 58 0 1 4 1 1 11 20 1 35 1 50 65 0 1 20 3 5 175 416 1 80 020 67 0 0 50 4 8 12 42 1 59 1 20 60 0 0 41 95 5 10 29 1 32 1 20 73 0 113 4 9 72 36 6 90 1 20 58 0 1 8 1 4 15 38 1 9 1 20 69 0 1 44 80 3 10 400 22 1 20 61 0 1 17 130 12 14 46 1 28 1 20 64 1 1 15 85 23 20 36 1 37

Two outputs were selected: 1) those having a high probability of lungcancer and 2) those having a low probability of lung cancer (controlsubjects).

In some embodiments, the concentration value of each biomarker and theclinical data were pre-processed numerically before being used as inputsfor the training of ANN. The values were normalized to have a mean equalto 0 and a standard deviation of 1, e.g., using the function “mapstd”.Subsequently, the normalized data were randomized before being used asinputs for the ANN. The data set was divided using the “divideind”function as follows: 70% for the training phase, 30% for the validationphase.

For the input layer, the biomarkers and clinical data described abovewere used. For the hidden layers, a tangential activation function wasused, e.g., a nonlinear tangential sigmoidal activation function. Forthe output layer, a linear activation function was used, ranging from 0to 1, e.g., the linear “purelin” activation function. A Scaled ConjugateGradient algorithm was used for training the ANN.

Other algorithms may be used, including but not limited to:Levenberg-Marquardt (LM), BFGS Quasi-Newton (BFG), ResilientBackpropagation (RP), Conjugate Gradient with Powell/Beale Restarts(CGB), Fletcher-Powell Conjugate Gradient (CGF), Polak-Ribiére ConjugateGradient (CGP), One Step Secant (OSS) and Variable Learning RateBackpropagation (GDX).

Optimal ANN architecture(s) were chosen based on the mean squared errortraining and the best classification percentage. It was determined whichANN architecture was the most appropriate for the dataset of biomarkersconcentration and clinical parameters. In order to determine the bestANN, around 800 ANNs with distinct configurations were tested: with onehidden layer (e.g., 1, 2, 3 neurons, etc.), two hidden layers (e.g.,different combinations i.e. 3-2, 5-3, 2-6, etc. nodes), and three hiddenlayers. The ANNs that presented the best ability to correctly classifythe largest possible number of data were chosen and saved. Optimalarchitecture(s) were chosen as the one(s) having the lowest trainingerror and the higher classification percentage. The following Table 2shows examples of different configurations (number of hidden layers, andnumber of nodes per layer) that were tested for the neural net system.

TABLE 2 1-hidden layer 2-hidden layer 3-hidden layer 1 1 1 1 1 1 2 1 2 12 1 3 1 3 1 3 1 4 1 4 1 4 1 5 1 5 1 5 1 6 1 6 1 6 1 7 1 7 1 7 1 8 1 8 18 1 9 1 9 1 9 1 10 1 10 1 10 1 11 1 11 1 1 1 12 1 12 1 2 1 13 1 13 1 3 114 1 14 1 4 1 15 1 15 1 5 1 16 1 16 1 6 1 17 1 17 1 7 1 18 1 18 1 8 1 191 19 1 9 1 20 1 20 1 10 1 21 2 1 2 1 2 22 2 2 2 2 2 23 2 3 2 3 2 24 2 42 4 2 25 2 5 2 5 2 26 2 6 2 6 2 27 2 7 2 7 2 28 2 8 2 8 2 29 2 9 2 9 230 2 10 2 10 2 31 2 11 2 1 2 32 2 12 2 2 2 33 2 13 2 3 2 34 2 14 2 4 235 2 15 2 5 2 36 2 16 2 6 2 37 2 17 2 7 2 38 2 18 2 8 2 39 2 19 2 9 2 402 20 2 10 2 41 3 1 3 1 3 42 3 2 3 2 3 43 3 3 3 3 3 44 3 4 3 4 3 45 3 5 35 3 46 3 6 3 6 3 47 3 7 3 7 3 48 3 8 3 8 3 49 3 9 3 9 3 50 3 10 3 10 351 3 11 3 1 3 52 3 12 3 2 3 53 3 13 3 3 3 54 3 14 3 4 3 55 3 15 3 5 3 563 16 3 6 3 57 3 17 3 7 3 58 3 18 3 8 3 59 3 19 3 9 3 60 3 20 3 10 3 61 41 4 1 4 62 4 2 4 2 4 63 4 3 4 3 4 64 4 4 4 4 4 65 4 5 4 5 4 66 4 6 4 6 467 4 7 4 7 4 68 4 8 4 8 4 69 4 9 4 9 4 70 4 10 4 10 4 71 4 11 4 1 4 72 412 4 2 4 73 4 13 4 3 4 74 4 14 4 4 4 75 4 15 4 5 4 76 4 16 4 6 4 77 4 174 7 4 78 4 18 4 8 4 79 4 19 4 9 4 80 4 20 4 10 4 81 5 1 5 1 5 82 5 2 5 25 83 5 3 5 3 5 84 5 4 5 4 5 85 5 5 5 5 5 86 5 6 5 6 5 87 5 7 5 7 5 88 58 5 8 5 89 5 9 5 9 5 90 5 10 5 10 5 91 5 11 5 1 5 92 5 12 5 2 5 93 5 135 3 5 94 5 14 5 4 5 95 5 15 5 5 5 96 5 16 5 6 5 97 5 17 5 7 5 98 5 18 58 5 99 5 19 5 9 5 100 5 20 5 10 5 101 6 1 6 1 6 102 6 2 6 2 6 103 6 3 63 6 104 6 4 6 4 6 105 6 5 6 5 6 106 6 6 6 6 6 107 6 7 6 7 6 108 6 8 6 86 109 6 9 6 9 6 110 6 10 6 10 6 111 6 11 6 1 6 112 6 12 6 2 6 113 6 13 63 6 114 6 14 6 4 6 115 6 15 6 5 6 116 6 16 6 6 6 117 6 17 6 7 6 118 6 186 8 6 119 6 19 6 9 6 120 6 20 6 10 6 121 7 1 7 1 7 122 7 2 7 2 7 123 7 37 3 7 124 7 4 7 4 7 125 7 5 7 5 7 126 7 6 7 6 7 127 7 7 7 7 7 128 7 8 78 7 129 7 9 7 9 7 130 7 10 7 10 7 131 7 11 7 1 7 132 7 12 7 2 7 133 7 137 3 7 134 7 14 7 4 7 135 7 15 7 5 7 136 7 16 7 6 7 137 7 17 7 7 7 138 718 7 8 7 139 7 19 7 9 7 140 7 20 7 10 7 141 8 1 8 1 8 142 8 2 8 2 8 1438 3 8 3 8 144 8 4 8 4 8 145 8 5 8 5 8 146 8 6 8 6 8 147 8 7 8 7 8 148 88 8 8 8 149 8 9 8 9 8 150 8 10 8 10 8 151 8 11 8 1 8 152 8 12 8 2 8 1538 13 8 3 8 154 8 14 8 4 8 155 8 15 8 5 8 156 8 16 8 6 8 157 8 17 8 7 8158 8 18 8 8 8 159 8 19 8 9 8 160 8 20 8 10 8 161 9 1 9 1 9 162 9 2 9 29 163 9 3 9 3 9 164 9 4 9 4 9 165 9 5 9 5 9 166 9 6 9 6 9 167 9 7 9 7 9168 9 8 9 8 9 169 9 9 9 9 9 170 9 10 9 10 9 171 9 11 9 1 9 172 9 12 9 29 173 9 13 9 3 9 174 9 14 9 4 9 175 9 15 9 5 9 176 9 16 9 6 9 177 9 17 97 9 178 9 18 9 8 9 179 9 19 9 9 9 180 9 20 9 10 9 181 10 1 10 1 10 18210 2 10 2 10 183 10 3 10 3 10 184 10 4 10 4 10 185 10 5 10 5 10 186 10 610 6 10 187 10 7 10 7 10 188 10 8 10 8 10 189 10 9 10 9 10 190 10 10 1010 10 191 10 11 10 1 10 192 10 12 10 2 10 193 10 13 10 3 10 194 10 14 104 10 195 10 15 10 5 10 196 10 16 10 6 10 197 10 17 10 7 10 198 10 18 108 10 199 10 19 10 9 10 200 10 20 10 10 10 201 11 1 202 11 2 203 11 3 20411 4 205 11 5 206 11 6 207 11 7 208 11 8 209 11 9 210 11 10 211 11 11212 11 12 213 11 13 214 11 14 215 11 15 216 11 16 217 11 17 218 11 18219 11 19 220 11 20 221 12 1 222 12 2 223 12 3 224 12 4 225 12 5 226 126 227 12 7 228 12 8 229 12 9 230 12 10 231 12 11 232 12 12 233 12 13 23412 14 235 12 15 236 12 16 237 12 17 238 12 18 239 12 19 240 12 20 241 131 242 13 2 243 13 3 244 13 4 245 13 5 246 13 6 247 13 7 248 13 8 249 139 250 13 10 251 13 11 252 13 12 253 13 13 254 13 14 255 13 15 256 13 16257 13 17 258 13 18 259 13 19 260 13 20 261 14 1 262 14 2 263 14 3 26414 4 265 14 5 266 14 6 267 14 7 268 14 8 269 14 9 270 14 10 271 14 11272 14 12 273 14 13 274 14 14 275 14 15 276 14 16 277 14 17 278 14 18279 14 19 280 14 20 281 15 1 282 15 2 283 15 3 284 15 4 285 15 5 286 156 287 15 7 288 15 8 289 15 9 290 15 10 291 15 11 292 15 12 293 15 13 29415 14 295 15 15 296 15 16 297 15 17 298 15 18 299 15 19 300 15 20

The initial weights between each connection and bias of ANN were set atthe beginning (randomized) and during the training the weights wereadjusted by the learning function: Gradient descent with momentumweight/bias learning. The criterion used to stop the training phase ineach ANN was when the root-mean-square error was less than 0.09 or whenthe correct classification rate was equal to or greater than 80%. Thevalues of the biomarkers and clinical data were directly involved in themodification of the connection weights in the ANN model during training.To avoid over-fitting, a 10-fold cross-validation was used.

The performance of the ANNs in the validation phase was evaluated using600 subjects: 459 with a diagnosis of lung cancer and 141 without lungcancer. The ROC curve was plotted and the AUC, sensitivity andspecificity were calculated. The sensitivity of the ANN with the bestcombination of biomarkers was compared with any biomarker high. Inaddition, multivariate logistic regression (MLR) using only thebiomarkers and another MLR employing both a combination of biomarkersand clinical parameters was plotted. The comparison was carried out inbasis to Receiver Operating Characteristic (ROC) curves at a specificityof relevant clinical value (80%).

Thirteen ANNs from 800 ANNs with different architectures showed the bestresults. From these thirteen ANNs, the ANNs named net, net4, net5, net6,net9 and net11 with different hidden layers and numbers of neurons/nodeswere tested using the entire dataset of 600 subjects: 459 with diagnosisof lung cancer and 141 without lung cancer. A summary of the bestperforming neural nets are provided in the following Table 3:

TABLE 3 Name of ANN net net2 net3 net4 net5 net6 net7 Architecture ofthe best ANNs 5 5 [5, 10, 5] 5, 20 6, 20 10, 20 1 AUC obtained to thebest ANNs 0.9204 0.9297 0.9211 0.9973 0.966 0.979 0.8559 Name of ANNnet8 net9 net10 net11 net12 net13 Architecture of the best ANNs 1 1, 5,1 6 6 3, 2 2, 3 AUC obtained to the best ANNs 0.8527 0.8535 0.90140.9295 0.853 0.858

The best ANN trained with the best classification performance in thetest phase was net4 and had a configuration of 2-hidden layers with 5and 20 neurons, respectively. This ANN correctly classified 89.3% (536of 600 subjects). The area under the curve value was 0.91. Thesensitivity at a specificity of 80% in the ROC curve for the ANN was90.2% (see, e.g., FIG. 15D). See Table 1 for the Clinical Factors usedand six biomarkers tested.

When the specificity was increased to 92.0%, the sensitivity was notcompromised, and remained at a value of 88.0% (data not shown). Thefollowing Table 4 shows the number of correct and incorrectclassification, AUC, sensitivity and specificity of the ANNs chosen fora data set (600 subjects).

TABLE 4 net net4 net5 net6 net9 net11 AUC 0.72 0.91 0.82 0.85 0.74 0.76Test negative Incorrect 69 11 32 26 51 44 Correct 72 130 109 115 90 97Total controls 141 141 141 141 141 141 Test positive Correct 424 406 395403 391 403 Incorrect 35 53 64 56 68 56 Total Cancer 459 459 459 459 459459 Sensitivity 0.92 0.88 0.86 0.88 0.85 0.88 Specificity 0.51 0.92 0.770.81 0.64 0.68

This model was compared to other models, see, e.g., FIGS. 15A-15C, andExample 4.

Example 4 Comparison of Statistical Models to ANN

According to embodiments of the present invention, a variety ofstatistical and machine learning approaches were utilized to classifyindividuals as having lung cancer or not having lung cancer (FIGS.15A-15D). At 80% specificity, sensitivities were determined forassessing a patient's likelihood of having cancer based on a singlebiomarker, e.g., referred to as Any Biomarker High. For example,sensitivity was relatively low using this method, e.g., for a givenbiomarker, the sensitivity was found to be 51% (FIG. 15A). In this modelany of the measured biomarkers, such as those of Table 6, when above aliterature recognized threshold for being indicative of the presence ofcancer are deemed “Any Marker High” and the patient is categorized ashaving cancer. For example, the cut off values for the biomarkers ofTable 1 are: CEA>=5 ug/L; CYFRA 21-1>=3.3 ug/L; NSE>=25 ug/L; SCC>=2ug/L; CA19-9>=37 u/mL and Pro-GRP>=50 ng/L.

In another embodiment, at 80% specificity, sensitivities were determinedfor assessing a patient's likelihood of having cancer based on sixbiomarkers and using multivariate logistic regression. See FIG. 15B. Inthis method, a line was used to divide a population of data points intotwo categories, based on the following equationy=β₀+β₁x₁+β₂x_(2 . . . +)β₀x₀. Typically, sensitivity was relatively lowusing this method, e.g., for the six given biomarkers, the sensitivitywas found to be 70.4%. In this case, the biomarkers were CA 19-9, CEA,Cyfra 21-1, NSE, Pro-GRP, SCC.

In another embodiment, at 80% specificity, sensitivities were determinedfor assessing a patient's likelihood of having cancer based on sixbiomarkers combined with clinical factors and using multivariatelogistic regression. See FIG. 15C and Table 1 for the list of clinicalfactors and biomarkers. In this method, a line was used to divide apopulation of data points into two categories, based on the followingequation y=β₀+β₁x₁+β₂x_(2 . . . +)β_(n)x_(n). Similar to the sixbiomarkers model, the sensitivity was relatively low (but better thanonly measuring the panel of six biomarkers) using this method, e.g., forthe six given biomarkers and clinical factors, the sensitivity was foundto be 75.6%. In this case, the biomarkers were CA 19-9, CEA, Cyfra 21-1,NSE, Pro-GRP, SCC, and the clinical factors were smoking status,packages years, patient age, family history of lung cancer, and recodedhas symptoms.

In yet another embodiment, at 80% specificity, sensitivities weredetermined for assessing a patient's likelihood of having cancer basedon six biomarkers combined with clinical factors and using an artificialneural network. See FIG. 15D. The neural net categorized patients aslikely to have cancer or not likely to have cancer.

In this example, (see also, FIG. 15D and Example 5), a Feedforward,Pattern Recognition Neural Network was used to classify patients aslikely or not likely to have lung cancer. Inputs to the ANN included thebiomarkers: CA 19-9, CEA, Cyfra 21-1, NSE, Pro-GRP, SCC, and theclinical parameters: smoking status, package per year, patient age,family history of lung cancer, and when available recoded has symptoms.Outputs to the neural network were provided as (1) likely to have lungcancer and (2) not likely to have lung cancer. A marked improvement insensitivity was achieved by this method, with the sensitivity beinggreater than 90%, and thus, better than any of the other methods ofFIGS. 15A-15C. In embodiments, wherein a patient is deemed as likely tohave lung cancer they are then recommended for diagnostic testing suchas CT testing.

In summary, the ANN at a specificity of 80% increased the sensitivity by39.2% as compared to Any Biomarker High (FIG. 15A) which had 51%sensitivity. The ANN at a specificity of 80% increased the sensitivityby 19.8% as compared to MLR combining only 6 biomarkers, which had 70.4%sensitivity. The ANN at a specificity of 80% increased the sensitivityby 14.6% as compared to MLR combining 6 biomarkers and clinical factors,which showed 75.6% sensitivity. The area under the curve values for ANN,Any Biomarker High, MLR (only biomarkers), MLR (biomarkers plus clinicalfactors) was 0.91, 0.76, 0.84, and 0.87, respectively.

Example 5 Identification of Novel Predictors of Disease

In some embodiments, the neural net can be used to identify novelpredictors of a disease. For example, a novel biomarker or clinicalfactor or other type of input as disclosed in this application (e.g.,from the literature, from the environment, etc.) may be selected as aninput into a neural net, and it can be determined whether the novelinput is predictive of lung cancer. In such cases, the novel input mayhave no known previous association with lung cancer.

For example, the novel input is selected as an input into a neural netsystem. The neural net is trained according to Example 3 and FIG. 14. Itis determined whether the sensitivity increases, remains about the same,or decreases as compared to the neural net without the novel input. Ifthe sensitivity increases, then the novel input may act as a predictorof the disease.

As a specific example, a secondary disease is selected as an input intoa neural net system. See for example, Lung Cancer and peripheralvascular surgery (1983) Beachamp G. et al. Can J. Surg 26(5):472-4. Theneural net is trained according to Example 3 and FIG. 14. It isdetermined whether the sensitivity increases, remains about the same, ordecreases as compared to the neural net without the secondary disease.If the sensitivity increases, then the secondary disease may act as apredictor of the disease. Additionally, these techniques may be used toidentify novel relationships between a disease such as lung cancer anddiseases which correlate or co-exist with lung cancer.

For example, using the techniques presented herein may be used todetermine that lung cancer and peripheral vascular disease arefrequently correlated, e.g., a patient that has lung cancer is alsolikely to have peripheral vascular disease.

Example 6 Ranking Input Factors

In some embodiments, the neural net can be used to rank input factorsfor a disease, to identify which inputs are the most predictive of adisease. For example, any number of novel biomarkers, clinical factors,or other types of inputs as disclosed in this application (e.g., fromthe literature, from the environment, etc.) may be selected as inputinto a neural net (e.g., tens, hundreds, thousands), and the neural netcan be used to determine which subset of inputs are the most predictiveof lung cancer. In some cases, the most predictive inputs may have noknown previous association or relationship to the disease, e.g., lungcancer.

TABLE B Ranking of biomarkers and clinical factors for predicting lungcancer. importance index cyfra 18.27887711 cea 17.70786983smoke_duration 16.50408067 nodule 15.37358411 nse 10.77973493 grp8.153103186 age 7.903903228 scc 6.183531192 ca 5.689815943 smoke_status3.943065227 cough 3.199732643 symptom 2.623152619 history 1.613154169pack 0.362940835 Cigarette daily 0.226657039

Example 7 Clinical Factors Plus Serum Biomarkers is Superior to EitherMethod Alone

Traditionally, physician judgment has formed the basis for lung cancerrisk estimation, patient counseling, and decision making. However,clinicians' estimates are often biased due to both subjective andobjective confounders. To mitigate this problem and to obtain moreaccurate lung cancer predictions, dozens of multi-biomarker panels havebeen developed over the last decade to better estimate the presence oflung cancer.

According to the embodiments presented herein, it has been found thatcognitive computing/machine learning approaches models further improvediscrimination accuracy in Lung Cancer Risk Estimation.

Statistical models can provide assistance in processing a large numberof variables (biomarker values and clinical factors). Several differentstatistical methods have been applied to discriminate between patientswith and without lung cancer, such as multivariate logistic regression(MLR), random forest (RF), classification and regression trees, supportvector machine (SVM), etc. These methods have been used to developalgorithms that combine measurements of the most predictive biomarkersin a panel to achieve the highest diagnostic accuracy. It is a goal ofpresent invention embodiments to develop a biomarker panel incombination with clinical factors, which may include additional inputsas well, which is cost-effective and can be deployed world-wide, even inareas of the world with medical systems constrained by costs usingmachine learning technology (e.g., neural networks or deep learningneural networks). Therefore, the development of a cost-effectiveplatform would be beneficial.

The goals of the current study were to confirm the accuracy of a panelof biomarkers on an independent data set, to explore the accuracyrelative to and in combination with clinical risk predictors with afocus on at risk patients relevant to lung cancer screening and tofurther investigate whether an advanced multi-parameter statisticalalgorithm can materially improve diagnostic accuracy of our lung cancertest.

Methods

Training Set Serum Samples.

All of the cancer and normal control samples used in the training setwere IRB-approved, consented serum samples that were purchased from theClinical Research Center of Cape Cod, Inc. (Cape Cod, Mass.), Asterand(Detroit, Mich.), Indivumed (Germany) or Bioreclamation IVT (New York,N.Y.). All of the lung cancer samples were collected at physicians'offices or hospitals.

All lung cancer and control serum samples were from patients 50 years ofage or older who were current or former smokers with a smoking historyof greater than 20 pack years and less than 15 years of smokingcessation. Diagnosis of the lung cancer cohort was confirmed fromsurgical pathology reports. The control group had no evidence of currentor prior cancer.

Testing Set Serum Samples.

All of the cancer and normal control samples used in the testing setwere obtained from an IRB approved blood biorepository at the ClevelandClinic. All patients had provided written informed consent. All lungcancer cases were biopsy confirmed and untreated. Control patientsamples were obtained from patients attending the lung cancer screeningclinic or general Pulmonary clinic.

Sample analysis.

Multiplex magnetic bead-based immunoassay of CEA, CYFRA21-1, CA125 andHGF in patient sera was performed using reagents from EMD Millipore,Inc. as previously described (Mantovani et al., Chemo-radiotherapy inlung cancer: state of the art with focus on the elderly population. AnnOncol. 2006; 17 (Suppl 2):ii 61-63.6.). The MILLIPLEX® MAP HumanCirculating Cancer Biomarker Magnetic Bead Panel 1 was used. Four tumorproteins (CEA, CYFRA21-1, CA125 and HGF) were measured using the MAGPIX®instrument (Luminex Corporation, Austin, Tex.) as previously described[Moyer Va.; US Preventive Services Task Force. Screening for lungcancer: US Preventive Services Task Force recommendation statement. AnnIntern Med. 2014; 160(5):330-338]. Using Median Fluorescent Intensity(MFI) values and a five-parameter logistic curve fitting method(xPONENT® software for the MAGPIX®) the concentrations of each tumorprotein in the samples were calculated. The calculated proteinconcentration values were used for the subsequent analysis.

NY-ESO1 autoantibody detection was performed using an immunoassaydeveloped at 20/20 Gene Systems, MD and the MAGPIX® reader as previouslydescribed [Id.]. Background subtracted MFI values were used for thesubsequent analysis.

Statistical Analysis.

The study cohort was divided into two groups based upon the outcome ofcancer or control. The demographics, comorbidities, and cancercharacteristics were described using sample mean with standard deviationor proportion as appropriate.

Multivariate logistic regression analysis: To determine the directionand statistical significance of the effect of each biomarker on theoutcome, we performed multivariate logistic regression (MLR) analysisfor the full data set. Each MLR model included the five biomarkers. TheAUC was calculated for the ROC curves that were constructed based on themodels. Exploratory MLR analyses were performed on the testing set,divided by stage and histology, and after including clinical variables.Clinical variables included age, sex, a clinical diagnosis of COPD, andsmoking history.

Random forests analysis: Random Forest (RF) models were used to identifythe variables that were associated with and predictive of cancer (Bach PB, Mirkin J N, Oliver T K, Azzoli C G, Berry D A, Brawley O W, et al.Benefits and harms of C T screening for lung cancer: a systematicreview. JAMA. 2012; 307(22):2418-29. doi:10.1001/jama.2012.552). Toavoid the possible overfitting of the MLR models, we used the repeatedrandom-split cross-validation procedure (Croswell J M, Kramer B S,Kreimer A R, Prorok P C, Xu J L, Baker S G, et al. Cumulative incidenceof false-positive results in repeated, multimodal cancer screening. AnnFam Med. 2009; 7:212-22). Specifically, we randomly split the data intotraining (70%) and validation (30%) sets 100 times. The RF model wasbuilt on each training set and then evaluated on the corresponding testset. The validation results were reported as the average performanceover all test sets. Exploratory RF analyses were performed on thetesting set, divided by stage and histology, and after includingclinical variables (as above).

Results

The training set consisted of 604 patient samples (268 with lung cancer,336 controls). 151 of those with lung cancer (56.3%) had adenocarcinomaand 144 of the 268 lung cancers (53.7%) were stage I. The testing setconsisted of 400 patient samples (155 with lung cancer, 245 controls).74 (47.7%) of those with lung cancer had adenocarcinoma and 52 of the155 lung cancers (33.5%) were stage I (Table 5).

TABLE 5 Clinical characteristics of the cancer and control patients inthe training and testing sets Training (604) Validation (400) CancerControl Cancer Control (268) (336) (155) (245) Age 64.0 64.5 65.3 68.3Sex (% F) 43.7 39.9 40 51.9 Smoking (C/F/N) N/A N/A 20/129/6 95/142/7Pack years >20 >20 43 35 Adenocarcinoma (%) 56.3 47.7 Squamous (%) 33.239.4 Stage I (%) 53.7 33.5 Stage II (%) 24.3 12.3 Stage III (%) 17.937.4 Stage IV (%) 4.1 16.8

Training set results showed that combination of the biomarkers studiedwas more accurate than the individual biomarkers considered alone (panelAUC 0.80 vs. individual AUC 0.45-0.71). A logistic regression model wasbuilt on the training set using the biomarker values and then applied tothe validation set. The diagnostic accuracy of the 4 biomarker panel inthe validation set was comparable with that of the training set (AUC0.81).

There was less meta-data available for the training samples for thealgorithm development that combines clinical factors and biomarkersvalues. Therefore, to evaluate an algorithmic approach that combinesbiomarker and clinical data further analyses were performed only on thevalidation set samples (n=400).

TABLE 6 Logistic Regression (LR) and Random Forest (RF) modelperformance using biomarker panels and clinical factors LR model RandomForest model (70:30 Split) Variable AUC* Sensitivity (%) Specificity (%)AUC* Clinical factors 0.68 34 85 0.66 Biomarkers 0.81 66 86 0.84Combined 0.86 80 80 0.87

In exploratory analysis, a Multivariate Logistic Regression (MLR) modelbuilt from clinical variables in the validation set (age, sex, COPD,smoking history) had an AUC of 0.68. When combined with the 4 biomarkerpanel the AUC was 0.86 (Table 6). Similarly, Random Forest (RF) modelingof the clinical factors and biomarker values alone yielded an averageAUC of 0.66 and 0.84, respectively. When combined with the 4 biomarkerpanel, the AUC improved to 0.87 (Table 6).

The validation sample set from the Cleveland Clinic (n=400) has asignificant number of samples that did not conform to the indicationcriteria of either the USPTF or PAULAs Test. “PAULAs test” (an acronymfor Protein Analytes Used for Lung cancer Algorithms) measures thelevels of serum antigens, an autoantibody, and several clinical factorsincluding patient age, smoking history, and prior lung disease. The testis intended to be used as an initial screen for non-small cell lungcancer (NSCLC) in asymptomatic individuals from a high-risk population(e.g. 20 pack-year current smokers or past smokers who quit less than 15years ago, and are over the age of 50) who are not receiving annual CTscans Specifically, samples are included with variations in smokinghistory, including some never smokers. Some patients have smokinghistories with less than 20 pack years (and <30 pack years as perUSPTF). Some patients are under age 50 (and under age 55 or over age 80as per USPTF).

Using random forest statistical analysis we evaluated the improvementsyielded by single predictor and identified the panel of classifiers thatseem most significant out of both the biomarkers and the clinicalfactors: CEA, CA-125, CYFRA and NYESO-1, age, smoking history, packyears and COPD. The performance of this panel in the populationconforming to PAULAs test inclusion criteria (e.g. 20 pack-year currentsmokers or past smokers who quit less than 15 years ago, and are overthe age of 50) was better than in a broader population that includedsmokers under age 50 and with less than 20 pack years (Table 7). Atapproximately the same specificity (79% vs 80%) the sensitivity fallsfrom 81% to 74% in a broader population. It should be noted, however,that sample size (400 vs 216) may also effect the difference between theresults.

TABLE 7 Performance of the test in the population within PAULAs testinclusion criteria and in a broader population. Sensitivity SpecificityCohort size AUC* % % All patients n = 400 0.845 74 79% Patients withinPAULAs n = 216 0.887 80 80 test inclusion criteria

FIG. 16A-16B shows the distribution of test scores in a patient cohortconforming to PAULAs test inclusion criteria. For this analysis, weexcluded never smokers and those with missing info, and limited thepatient cohort to the PAULA's test inclusion criteria. These figuresshow distribution of PAULAs test scores using RF model (CEA, CA-125,CYFRA and NYESO-1, age, smoking history, pack years and COPD): 16A.box-and whiskers plot. 16B. scatter dot plot. The horizontal line inFIG. 16B shows the PAULAs test cut-off of 0.43 derived from thevalidation set results.

TABLE 8 Performance of the combined biomarker-clinical factors panel bylung cancer stage. Sensitivity (%) at 80% Specificity All Patientswithin PAULAs patients test inclusion criteria I 69.6% 82.4% II 70.6%84.6% III 63.0% 75.0% IV 82.4% 90.0%

Using the test cutoff that corresponds to 80% fixed specificity (0.43),we evaluated the accuracy of the combined panel by stage in both groupsof patients. The detection sensitivity of the early stages (I and II) inpatients corresponding to PAULAs test inclusion criteria was higher thanin a broader population—83.5% vs 70.1% (Table 8).

We also explored deep neural network (DNN) modelling approaches for thetest performance evaluation using the entire validation set fromCleveland Clinic (n=400). To build a DNN model, we first identified theinput variables, which included both clinical factors and biomarkers. Wethen applied 2 hidden layers, 1000 nodes in the first layer, and 5000nodes in the second layer. Tan h activation function was adopted in theDNN method. With 70% data points as the training dataset and 30% of datapoints as testing set, the DNN model produced a higher AUC (0.89) thanrandom forest (0.88) and logistic regression (0.87) models (Table 9).

TABLE 9 Comparison of PAULAs test results using biomarkers and clinicalvariables and different modelling approaches (LR, RF and DNN)Sensitivity, Specificity, Method AUC* 95% CI^(#) % % Logistic Regression0.86 0.80-0.94 75 80 Random Forest 0.88 0.81-0.95 80 80 Deep learning(DNN) 0.89 0.83-0.96 90 82

Discussion

The current study validated the clinical accuracy of a combined proteinand antibody panel in a population at risk of having lung cancer andexplore the impact of combining clinical and biomarker variables on testaccuracy. The intended use population for this study was patients atrisk of having lung cancer. The results suggested that the combinationof markers is more accurate than any of the markers alone. Inexploratory analysis, the highest accuracy was achieved by combiningclinical features and biomarker results for patients within PAULAs testinclusion criteria (50 years of age or older who were current or formersmokers with a smoking history of greater than 20 packs per year andless than 15 years of smoking cessation). Based on the Random Foreststatistical algorithm the test yielded the following performance: 80%sensitivity, 80% specificity, 0.88 AUC when both biomarker values (CEA,CYFRA, CA125 and NY-ESO1) and clinical factors (age, smoking history,pack-years and COPD status) were considered.

To pursue clinical utility testing it should be determined if theresults of this study support further development of this biomarker asan early detection tool. The accuracy of the test should support thepotential application. To estimate the accuracy required to justifyinvestment in a clinical utility study, a formula has been suggestedthat incorporates the accepted benefit: harm balance of current standardpractice [(Pepe M S, Janes H, Li C I, Bossuyt P M, Feng Z, Hilden J.Early-phase studies of biomarkers: What target sensitivity andspecificity values might confer clinical utility? Clin Chem 2016;62(5):737-742.) If we use this formula to determine a test accuracy thatwould allow us to use the results of this test to select patients forlung cancer screening from a population with a 0.2% incidence of lungcancer, and assume that we currently accept screening a population witha 0.83% incidence of lung cancer (the incidence during the screeningyears of the National Lung Screening Trial [The National Lung ScreeningTrial Research Team. Reduced lung-cancer mortality with low-dosecomputed tomographic screening. N Engl J Med. 2011; 365:395-409.doi:10.1056/NEJMoa1102873]), TPR (True positive rate, orSensitivity)/FPR (False positive rate, or (1−Specificity)) of the testwould have to be at least 4. Based on this analysis, the accuracy of thebiomarker panel in the current study (e.g. sensitivity of 80% atspecificity of 80% (RF model) or sensitivity of 90% at specificity of82% (DNN model) met the minimal biomarker panel performance (TPR/FPR=4)to support further development of the test as a screening tool. Inaddition, the cost of this test would be much lower than mostomics-based testing platforms currently available. This is alsoimportant to consider when developing a screening test.

We also have developed a risk categorization tool based on the resultsfrom this study. This test generates a composite score from the RandomForest model comprising 4 clinical parameters and the levels of 4biomarkers in patient serum. This score is an indicator of the level ofrisk for each patient of currently having lung cancer relative to otherswith a comparable smoking history. Using two cutoffs (0.43 and 0.62),the test results were broken down into three separate categories withincreasing risk factors (Table 10). Table 10 indicated the probabilityof lung cancer for patients in a given score range at the time oftesting. Positive predictive value (PPV) is the probability that aperson with a positive test score above the chosen cutoff truly has thedisease. Unlike sensitivity and specificity, the PPV is dependent on thepopulation being tested and is influenced by the prevalence of thedisease. For the PPV calculation we used 0.83% lung cancer prevalencefrom the NLST study [The National Lung Screening Trial Research Team.Reduced lung-cancer mortality with low-dose computed tomographicscreening. N Engl J Med. 2011; 365:395-409.]. Table 10 shows that thehigher patient's score on PAULAs test the greater the likelihood thatthis patient has cancer.

TABLE 10 Test PPV in 3 separate score categories Score Range SensitivitySpecificity PPV X ≥ 0.62 55.1% 95.3% 8.89% 0.43 ≤ X < 0.62 62.2% 84.0%3.16% X < 0.43 100.0% 0.0% 0.83%

Below the cutoff of 0.43, the test will not differentiate between cancerand non-cancer. Individuals whose scores fell within this range had thesame likelihood of having lung cancer as those people currentlyrecommended for LCDT by the USPTF (0.83%). Individuals whose scores fellwithin the middle range were 3.8× more likely to have lung cancer thanindividuals currently recommended for LCDT by the USPTF. Finally,individuals whose scores fell within the high range were 10.7× morelikely to have lung cancer than individuals currently recommended forLCDT by the USPTF (US Preventative Services Task Force). The result ofthe test presented using such categorization table will inform thephysician about the degree of the lung cancer risk a patient has after apositive result on the test.

The strengths of the current study included a reasonably large number ofsamples from a cohort relevant to the potential clinical application,with samples obtained from more than one source. The sample setsincluded a substantial portion of cases with early stage disease, and adiverse set of relevant patient comorbidities, supporting the robustnessof the method. The results were compared to, and were more accurate thanclinical prediction, and the combination of the marker results withclinical features improved the accuracy of both. Exploratory analysiswas performed on only the validation set from the Cleveland Clinic.

In summary, this study validated the accuracy of a panel of proteins andan autoantibody in a population relevant to lung cancer screening, andsuggested a benefit to combining clinical features with the biomarkerresults.

Example 8 Study of Lung Cancer Biomarker Expression and ClinicalParameter Variables

The National Lung Screening Trial (“NLST”) showed that a low-dose CT(LDCT) screening program could reduce disease-specific mortality inhigh-risk patients by 20% and overall mortality by 7%, which proved thatearly lung cancer detection saves lives (and is believed to reducelifetime disease-specific medical costs) [The National Lung ScreeningTrial Research Team. Reduced lung-cancer mortality with low-dosecomputed tomographic screening. N Engl J Med. 2011; 365:395-409.doi:10.1056/NEJMoa1102873]. However, the major LDCT drawbacks include ahigh false-positive rate and the inability to unambiguously distinguishbenign nodules that can involve expensive invasive follow-up procedures[Bach P B, Mirkin J N, Oliver T K, Azzoli C G, Berry D A, Brawley O W,et al. Benefits and harms of C T screening for lung cancer: a systematicreview. JAMA. 2012; 307(22):2418-29; Croswell J M, Kramer B S, Kreimer AR, Prorok P C, Xu J L, Baker S G, et al. Cumulative incidence offalse-positive results in repeated, multimodal cancer screening. Ann FamMed. 2009; 7:212-22; Wood D E, Eapen G A, Ettinger D S, et al. Lungcancer screening. J Natl Cancer Compr Netw 2012; 10:240-265].False-positive LDCT results occur in a substantial proportion ofscreened persons; 95% of all positive results do not lead to a diagnosisof cancer. Most pulmonary experts believe that biomarker testing isrequired to compliment radiographic screening as LDCT achieves itseventual steady-state utilization.

A cohort of 459 subjects of current and former (stopped within the last15 years) smokers with pulmonary nodules and confirmed lung cancer (lungcancer test group), and 139 matched controls with confirmed benign lungnodules participated in the current study. All participants were 50years or older with a 20 pack year, or more, smoking history. Allsubjects donated blood within 6 weeks of radiographic screening to beused for measurement of biomarkers. Radiographic screening was used tocharacterize the pulmonary nodules including size and number. Theassociated patient information comprised the ages, genders, races, finaldiagnoses including stage of lung cancer and histological type, familyhistory of lung cancer, pack years, packs per day (e.g. smokingintensity), smoking duration (years), smoking status, symptoms, cough(yes or no) and blood in sputum.

Demographic and Clinical Information

For the control group the medium age was 58 years, 91% were male (9%female), 50% were asymptomatic and 9% had a family history of lungcancer. For the test group (confirmed lung cancer) the medium age was62, 91% were male (9% female), 43% were asymptomatic and 8% had a familyhistory of lung cancer. The smoking history between the test and controlgroups were similar with both groups having a median pack year of 40. Inthe control group 87% were current smokers with a median age of quittingat 53.5 years and 3 years since quitting, as compared to 89% in the testgroup with a median age of quitting at 60 and 4 years since quitting. Inthe lung cancer group, 44% were staged as early (stage I and II) and 56%as late (stages III and IV). The lung cancer was typed as adenocarcinoma40%, squamous 34%, small cell 19%, large cell 4% and other 3%.

The serum biomarkers were measured using commercially available reagentsand immunoassay techniques from Roche Diagnostics. The measuredbiomarkers included CEA, CA 19-9, CYFRA 21-1, NSE, SCC, and ProGRP andlevels were reported as test values. The obtained clinical parametersincluded family history of lung cancer, nodule size, pack years, packsper day (or smoking intensity), patient age at time of study, smokingduration (years), smoking status, cough (binary), blood.

TABLE 11 Benign Nodules (Control group) Median Biomarker (protein orunit) CA 19-9 9 CEA 2 CYFRA 2 NSE 11 Pro-GRP 34 SCC 1

TABLE 12 Lung Cancer (Test group) Median Biomarker (protein or unit) CA19-9 11 CEA 4 CYFRA 4 NSE 13 Pro-GRP 37 SCC 1

Analysis

Each of those variables (biomarkers or clinical parameters) was analyzedin a univariate logistic regression model and together in a multivariatelogistic regression model. The variable analysis is provided below asarea under the curve (AUC) of receiver operating characteristic (ROC)curves.

TABLE 13 Biomarker and clinical parameter analysis Model Variable(s) AUCunivariate Nodule size 0.69 univariate Pack years 0.50 univariate Packsper day (smoking intensity) 0.53 univariate Patient Age at time of Study0.66 univariate Smoking Duration (years) 0.57 univariate Blood 0.51univariate Cough (yes or no) 0.59 univariate CA 19-9 0.58 univariate CEA0.69 univariate CYFRA 0.75 univariate NSE 0.68 univariate ProGRP 0.60univariate SCC 0.60 Multivariate CEA, CYFRA, NSE, ProGRP, nodule size,0.87 patient age, smoking duration (years) and cough (yes or no)

The biomarkers were further analyzed comparing a 6-marker panel and a5-marker panel with and without clinical parameters. The AUC valuecalculated from the biomarker panel and the clinical parameter panel wascompared to the biomarker panel plus the clinical parametersdemonstrating an improvement with the addition of the clinical parametervariables into the multivariate logistic regression model analysis. Ofthe biomarkers tested, four contribute to the analysis fordistinguishing benign from malignant nodules; they are CEA, CYFRA, NSEand ProGRP. Of the clinical parameters tested, six contribute to themultivariate analysis for distinguishing benign from malignant nodules;they are patient age, smoking status, smoking history (including packyears, smoking duration in years and smoking intensity), chest symptoms(such as thoracalgia, blood in sputum, chest tightness), cough andnodule size.

TABLE 14 6-biomarker Panel and Clinical Parameter Analysis Sensitivityat Sensitivity at Model AUC 80% Specificity 90% Specificity IndividualMarkers CA19-9 0.58 CEA 0.69 CYFRA 0.75 NSE 0.68 SCC 0.60 ProGRP 0.60Clinical Parameters Only 0.75 53.9% 30.5% 6-marker Panel¹ 0.83 71.8%59.6% 6-marker panel² 0.84 70.5% 64.7% 6-marker panel + 7 0.87 74.3%66.9% clinical parameters³ 4 Best Markers + 6 Best 0.87 75.8% 70.2%Clinical parameters⁴ ¹Values normalized using MOM method ²Multivariatelogistic regression analysis ³Age, Smoking Status, Smoking history (packyears and packs per day), chest symptoms, cough, family history of lungcancer and nodule size. ⁴Step-wise MLR analysis; CEA, CYFRA, NSE andPro-GRP; Age, smoking status, pack years, chest symptoms, cough andnodule size

TABLE 15 5-Biomarker Panel and Clinical Parameters Analysis Sensitivityat Sensitivity at Model AUC 80% Specificity 90% Specificity IndividualMarkers CA19-9 0.58 CEA 0.69 CYFRA 0.75 NSE 0.68 SCC 0.60 ClinicalParameters Only 0.75 53.9% 30.5% 5-marker panel⁵ 0.82 70.6% 57.2%5-marker panel⁶ 0.84 68.8% 63.8% 5-marker panel + 7 clinical 0.87 74.7%64.2% parameters 3 Best Markers + 6 Best 0.87 75.6% 68.4% ClinicalParameters ⁵Values normalized using MOM method ⁶Multivariate logisticregression analysis

Example 9 A Multi-Marker Algorithm for Distinguishing Benign VsMalignant Pulmonary Nodules

The cohort of 459 subjects of current and former (stopped within thelast 15 years) smokers with pulmonary nodules from Example 1 wasexpanded to a total cohort of 1005 subjects, wherein the objectives ofthis study were to screen a large amount of existing data in a costeffective and rapid approach for risk assessment algorithm developmentand to demonstrate the importance of using algorithms to generateresults from a panel of markers rather than the “any marker high”method. We also explored using advanced machine learning models toclassify lung nodules as benign or malignant. Herein, we report thedevelopment of models and calculators for predicting the probability oflung cancer in pulmonary nodules using data from LDCT screening cohort(n=1005).

Data from a cohort of 1005 subjects with radiographically apparentpulmonary nodules were obtained and analyzed as disclosed below and inExample 8, wherein 502 participants had malignant nodules “cancer” and503 participants were a “control” group with begin nodules. Thecollected data was blinded prior to analysis. All subjects chosen forinclusion in the study were: a) age 50-80 at the time of initialevaluation; b) 20+ pack-year smokers, and c) current smokers or smokersthat quit within the last 15 years and included both, symptomatic andasymptomatic subjects. All subjects were tested for the following cancerbiomarkers: CEA, CYFRA 21-1, NSE, CA 19-9, Pro-GRP and SCC. Thediagnosis of each cancer patient (those with radiographically apparentpulmonary nodules) was confirmed by clinical outcome, imaging diagnosisand histological examinations. The following clinical characteristics ofeach participant was also collected: age at time of blood draw, gender,smoking history (current or former), pack-years, family history of lungcancer, presence of symptoms, concomitant Illnesses, and number and sizeof nodules.

TABLE 16 Clinical characteristic of the cancer and control subjectsCancer Control (502) (503) Age 62 58 Sex (% Male) 91 91Symptomatic/Asymptomatic (%) 57/43 58/42 Median Pack years 40 35Current/Former smokers (%) 89/11 87/13 Adenocarcinoma (%) 41 Squamous(%) 34 Small Cell (%) 18 Large Cell (%)  3 Stage I (%) 54 Stage II (%)24 Stage III (%) 18 Stage IV (%)  4

The protein biomarker concentrations were determined by a microparticleenzyme immunoassay using Abbott reagent sets (Abbott, USA) and measuredby a chemical luminescence analyzer (ARCHITECT i2000SR, Abbott, USA)according to manufacturer's recommendations.

Statistical Analysis

Logistic regression was used to predict the binary (yes/no) cancerpatient outcome using a vector of independent variables that werecontinuous (e.g. biomarker concentration values) or dichotomous (e.g.current or former smoker). In the logistic model the binary (yes/no)outcome is converted to a probability function [ƒ(p)] using thefollowing equation:

${f(p)} = ( \frac{p}{1 - p} )$

Therefore, the probability function can then be used in a predictivemodel including an intercept (α), and an estimate (β) for a predictor(X).

ƒ(p)=α+βX

When more than one predictor is used, the model is called a multivariatelogistic regression:

ƒ(p)=α+β₁ X _(i1)+β₂ X _(i2)+ . . . +β_(p) X _(ip)

Stepwise logistic regression is a special type of multivariate logisticregression where predictors are iteratively included in the model if thepredictive strength of the chi-square statistic for the predictor meetsa pre-determined significance threshold (alpha=0.3).

The entire data set (N=1005) was treated as a training data set formodel development. The panel of 6 biomarkers (CEA, CYFRA 21-1, NSE, CA19-9, Pro-GRP and SCC) and 7 clinical factors (smoking status, packyears, age, history of lung cancer, symptoms (e.g., symptoms and signsassociated with lung cancer: coughing, coughing up blood, shortness ofbreath, wheezing or noisy breathing, loss of appetite, fatigue.recurring infections, etc.), nodule size and cough) were analyzed. Inthe analysis, symptoms with no numerical value (e.g. coughing) areassigned a binary value, 1 or 0, either the symptom is present or itisn't whereas symptoms with a numerical value, e.g. age or pack years,are used in the analysis. The MLR models developed were compared to “anymarker high” approach wherein if any individual biomarker value is aboveits respective cut-off point, the test is considered positive. For newmodel development, we added clinical parameters to the biomarker panel.In embodiments, the MLR is used to calculate a probability value (alsoreferred to herein as a composite score or predicted probabilities) forthe measured values of the panel of biomarkers and clinical parameters,that probability value is then compared to a threshold value todetermine whether or not the probability value is above or below thethreshold value, wherein the radiographically apparent pulmonary nodulesin a patient are classified as malignant, if the probability value isabove the threshold value, or the radiographically apparent pulmonarynodules in a patient are classified as benign, if the probability valueis below the threshold value. In embodiments, that threshold value issimply a predictive value of 50% wherein a patient with a predictivevalue about 50% is either classified as having malignant pulmonarynodules or is considered to have an increased likelihood for malignancypulmonary nodules. In other embodiments, the threshold is determinedbased on an 80% sensitivity wherein a ROC/AUC analysis is performedbased on the predictive value to determine if it is above or below a setthreshold value.

A series of alternative statistical methods to predict Lung Cancer(malignant pulmonary nodules) were tested in three runs each using 80%of the sample as the training data set and 20% as a testing set. Thefollowing methods were run side by side on the model with the followingclinical parameter and biomarker panels: Smoking Status, Patient Age,Nodule Size, CEA, CYFRA and NSE. In this study, that panel was the mostpredictive (highest AUC) for correctly distinguishing benign frommalignant pulmonary nodules.

1. Logit model: simple traditional logistic regression model;2. Random forest: this is done using Breiman's random forest algorithmfor classification and regression, which could avoid overfitting thetraining dataset. A total of 500 decision trees to run the randomforests.3. Neural network: Use the traditional backpropagation algorithm in themodel, and 2 hidden layers.4. Support vector machine (SVM): use the default setting of R package“e1071”;5. Decision tree: use recursive partitioning and regression trees in Rpackage “rpart”;6. Deep learning: Use the default setting of R package “h2o” which has200 hidden layers in the neural network.

All statistical analyses were performed using SAS® v9.3 or higher.

Results

Logistic regression (univariate, multivariate and stepwise multivariate)was used to develop an algorithm for lung cancer risk prediction.Results of the logistic regression analyses performed to predictmalignant pulmonary nodules are reported in Table 17:

TABLE 17 Univariate and multivariate logistic regressions predictinglung cancer (N = 1005) Logistic AUC (Area Under the Curve) SensitivityRegression Lower Upper at 80% Method Model AUC 95 CI 95 CI SpecificityUnivariate Smoking Status 0.51 0.49 0.53 20.5 Univariate Pack-years 0.590.56 0.63 26.3 Univariate Age 0.66 0.63 0.70 39.1 Univariate History ofLC 0.50 0.49 0.52 20.1 Univariate Symptoms 0.52 0.49 0.56 21.9Univariate Nodule Size 0.71 0.68 0.74 47.3 Univariate CA 19-9 0.58 0.540.62 31.6 Univariate CEA 0.71 0.68 0.74 50.2 Univariate CYFRA 0.77 0.740.79 59.3 Univariate NSE 0.70 0.67 0.73 49.1 Univariate SCC 0.60 0.570.63 37.2 Univariate cough 0.56 0.53 0.59 27.2 Univariate Any markerhigh 0.74 0.70 0.77 46.0 Multivariate All 6 Biomarkers 0.84 0.81 0.8770.4 Multivariate All Predictors (6 0.87 0.85 0.90 75.2 Biomarkers and 7Clinical Factors) Multivariate 3 Biomarkers and 0.88 0.85 0.89 76.0 3Clinical Factors

As shown in Table 17, the combination of the biomarkers in both, “anymarker high” univariate model or multivariate model using all 6biomarkers (Smoking Status, Patient Age, Nodule Size, CEA, CYFRA andNSE), was more accurate than the individual biomarkers considered alone(AUC 0.51-0.77 vs. 0.74 and 0.84). However, the univariate “any markerhigh” model with an 0.74 AUC was clearly not as good a predictive modelas compared to the multivariate model with all 6 biomarkers (0.84).

For a new model development, we added clinical parameters to thebiomarker panel combining all 6 biomarkers (CEA, CYFRA, NSE, Pro-GRP,SCC, CA 19-9) and 7 clinical variables (Family History of lung cancer,Nodule size, Recoded Symptoms (e.g., those associated with early or latestage lung cancer such as symptoms and signs associated with lungcancer: coughing, coughing up blood, shortness of breath, wheezing ornoisy breathing, loss of appetite, fatigue. recurring infections, etc.),Pack-years, Patient Age, Smoking Status, Cough). This model yielded thehighest AUC of 0.87. When specificity was fixed at 80%, the sensitivityfor 1) “any marker high” model, 2) model with 6 biomarkers only, and 3)the combined 6 biomarkers and 7 clinical factors model was 46.0%, 70.4%and 75.2% respectively.

On the basis of both the univariate and multivariate results, the panelof six predictors (3 biomarkers and 3 clinical factors) was chosen: CEA,CYFRA, NSE, Smoking Status, Patient Age at exam, and Nodule Size. Thispanel of 6 predictors resulted in the best discrimination accuracy with0.88 AUC and 76% sensitivity at 80% specificity (FIG. 17, Table 17).

The algorithm used for computing risk (i.e. probability of lung cancer)with this model was:

ƒ(p)=+β_(SmokingStatus) X _(SmokingStatus)+β_(PatientAgeAtExam) X_(PatientAgeAtExam)+β_(NoduleSize) X _(NoduleSize)+β_(TestValue) _(_)_(CEA) X _(TestValue) _(_) _(CEA)+β_(TestValue) _(_)_(CYFRA)+β_(TestValue) _(_) _(NSE) X _(TestValue) _(_) _(NSE)

Using the combined biomarker-clinical model, we performed evaluation ofthe test accuracy by cancer stage and histology. Table 18 shows that thetest sensitivity was improved as the cancer stage increased. The mostprevalent NSCLC type, adenocarcinoma and squamous cell carcinoma (SCC),demonstrated similar performance in this study (sensitivities 72% and77%; AUC 0.85 and 0.87, respectively, p<0.0001) (Table 18). The smallcell lung cancer (SCLC), a fast-growing type of cancer which representschallenges in early detection and diagnosis, was detected with 0.95 AUCand 82% sensitivity at 80% specificity.

TABLE 18 Multivariate logistic results including the variables SmokingStatus, Patient Age, Nodule Size, CEA, CYFRA and NSE categorized bystage and Histological Subtype AUC* Sensitivity Lower Upper at 80%Sample AUC 95 CI^(#) 95 CI Specificity Sample All cases and 0.87 0.840.89 76.2 cases = 502, controls controls = 503 Stage I 0.76 0.72 0.8055.6 cases = 180, controls = 503 Stage II 0.93 0.89 0.97 76.5 cases =51, controls = 503 Stage III 0.93 0.91 0.95 87.3 cases = 158, controls =503 Stage IV 0.97 0.95 0.99 92.0 cases = 112, controls = 503 Small CellLung 0.95 0.93 0.98 82.4 cases = 91, Cancer controls = 503 Squamous Cell0.87 0.84 0.91 77.2 cases = 171, Carcinoma controls = 503 Adenocarcinoma0.85 0.82 0.88 72.1 cases = 208, controls = 503

Based on the 3 biomarkers plus 3 clinical factors model, relative riskof a patient having lung cancer (a comparison of the proportion of‘positive’ outcomes in the cases vs. the controls) was calculated. Apatient's measured biomarker concentrations and numerical clinicalpredictors (e.g. 0 or 1 for yes or no clinical parameters or a relevantnumber such as age, pack years, size of nodules) were multiplied by themaximum likelihood estimates from the logistic regression model. Thesevalues are then summed and multiplied by 100 to calculate a patient'sprobability of % risk of cancer. This could be a diagnostic tool to letdoctors know the probability that their patient has lung cancer based onthe model we are using. In addition, those patients with an increasedrisk for lung cancer can then either be screened using CT or providedwith a therapeutic treatment.

Advanced Cognitive Computing Approaches Models

We also evaluated Deep learning Neural Networks (DNN) method, as well asother modelling approaches (random forest, classification and regressiontrees, support vector machine), using the entire data set (n=1005)(Table 19). These methods have been used to develop algorithms thatcombine measurements of the most predictive biomarkers and clinicalparameters in a panel to achieve the highest diagnostic accuracy. Theresults summarized in Table 19 demonstrated that the DNN method providesbetter prediction accuracy in discrimination lung cancer and benignpulmonary nodules than the other methods.

TABLE 19 Comparison of results using 3 biomarkers and 3 clinicalvariables (Smoking Status, Patient Age, Nodule Size, CEA, CYFRA and NSE)from different modelling approaches (Random Forest, SVM, Decision treeand Deep Learning Neural Network) to predict lung cancer. Sensitivity atMethod AUC* 95% CI^(#) 80% Specificity Random Forest 0.862 0.821-0.90275 SVM 0.848 0.805-0.891 69 Decision tree 0.806 0.759-0.852 71 Deeplearning (DNN) 0.890 0.832-0.910 79

Model Cross Validation:

Cross validation is one important model validation technique forassessing how the results could be generalized to an independent dataset. We applied repeated random sub-sampling validation, where werandomly split the dataset into training and validation set by differentratios. The results were averaged over the splits and provided in Table19.

Relationship with Nodule Size

Further analyses of the data set from the cohort of n=1005 was focusedon the relationship between nodule size and probability that a nodule ismalignant.

The histogram (See FIG. 27) shows the distribution of nodule sizes for“cancer” and “control” participants in the cohort of n=1005. 535patients in this set had nodules with 30 mm or higher in diameter. Ingeneral, the size of lung nodules was higher in patients with lungcancer (malignant nodules) than in benign nodules. The entire data setwas categorized into 3 nodule sizes: 0-14, 15-29, and ≧30 mm. Theunivariate and then multivariate and stepwise multivariate logisticregression analyses was performed on 3 subsamples of the n=1005 cohortdata set. Based on the results, the best model combining biomarkervalues and clinical factors was chosen for each nodule size category.See Table 20. The MLR model for the first nodule category (below 14 mm)includes 4 biomarkers (CEA, CYFRA, NSE, Pro-GRP) and 4 clinicalparameters (patient age at the time of exam, cough, smoking duration,presence of symptoms). Pro-GRP did not improve the test accuracy fornodule groups 2 and 3 and was omitted from the model.

TABLE 20 Model performance by nodule size category Nodule Lower UpperVariables in the model size Samples AUC* 95% Cl^(#) 95% Cl^(#)Sensitivity Specificity 4 Biomarkers (CEA, 0-14 mm cases = 23, 0.84 0.730.95 60.9 88.9 CYFRA, NSE, Pro-GRP) + controls = 54 4 clinicalparameters 3 Biomarkers (CEA, 15-29 mm cases = 148, 0.79 0.75 0.84 62.877.2 CYFRA, NSE) + 4 controls = clinical parameters 193 3 Biomarkers(CEA, ≥30 mm cases = 331, 0.91 0.89 0.94 83.7 81.9 CYFRA, NSE) + 4controls = clinical parameters 204

FIG. 19 shows ROC graphs for the three nodule subgroups. As shown inTable 20 and FIG. 19, the AUC of the combined biomarker-clinical factorsassessment in patients with small nodules (0-14 mm) was 0.84, withintermediate size nodules (15-29 mm) 0.79 and in those with largenodules (above 3 cm) 0.91.

The best model is a combination of 3 Biomarkers (CEA, CYFRA, NSE)+4clinical parameters (Patient Age, Cough, and Smoking Duration)) todistinguished malignant intermediate size nodules (15-29 mm) from benignwith 62.8% sensitivity and 77.2% specificity. See Table 20. The samecombination of biomarkers and clinical parameters was used for the largesize nodules (≧30 mm) and classified the difference between benign andmalignant nodules with higher sensitivity and specificity at 83.7% and81.9%, respectively. See Table 20. For the smallest nodules (0-14 mm)the best model was 4 biomarkers (CEA, CYFRA, NSE, and Pro-GRP) and 4clinical parameters (Symptoms, Patient Age, Cough and Smoking Duration).

To calculate % probability of lung cancer in each nodule size categorythe maximum likelihood estimates from the MLR model were used. Scatterdot plot in FIG. 20 shows the lung cancer probability for each nodulesize category.

Discussion

The high sensitivity of LDCT comes at the cost of detecting many falsepositives, including benign pulmonary nodules. Studies indicated thatradiologists have a difficult time effectively differentiating true(malignant) nodules from false positives. Moreover, the management ofsmall lung nodules discovered on screening CT scans has become a verydifficult problem. When nodules are found between 8 mm to 15-20 mm insize (Lung-RADS ver. 1.0 assessment categories 4A, 4B, and 4X),physicians face a wide array of choices and balance a complicatedclinical picture. Patients categorized as Lung-RADS Category-4 (evidentin about 6% of all LDCTs in the USA) present a quandary to physicians ofwhether to include additional LDCT, full-exposure CT with or withoutcontrast, PET-CT, needle biopsy or resection. A blood biomarker testthat can identify patients with higher-risk and alternatively, lowerrisk of lung cancer (with a significant gray-zone) would beneficiallyimprove the care and cost of handling patients with lung cancer.

We now have compelling evidence that by using an algorithmic approach wecan generate a risk score (increased risk of lung cancer) that is moreaccurate than a risk assessment obtained from any individual marker orby a “multiple cutoff” approach. In this study, we analyzed a large dataset (n=1005) from a retrospective cohort of high risk patients fromChina and demonstrated in this training set that the accuracy of thebiomarker test was significantly improved using an algorithm thatintegrates biomarker values and clinical factors. The overallsensitivity of the combined MLR-based biomarker-clinical model was 76%at a specificity of 80% and 0.88 AUC. This performance was significantlysuperior to that of the univariate “any marker high” model with an AUCof 0.74 and 46% sensitivity at 80% specificity. Sensitivity for earlystage disease (I and II) in this study was approximately 66% at 80%specificity (based on 3 biomarkers plus 3 clinical factors MLR model)compared to ˜90% sensitivity for late stage (III and IV). The use ofdeep learning neural networks method further improved the testperformance resulting in the sensitivity of 77% at 80% specificity.These preliminary results showed that deep neural network providedbetter prediction accuracy results than the other methods.

We also established an algorithm in an intent-to-test population ofpatients with indeterminate single pulmonary nodules. Lung nodules thatare more than 30 mm in size are presumed to be malignant and are removedby surgery. Nodules between 5-30 mm may be benign or malignant, with thelikelihood of malignancy increasing with size. Therefore, the blood testthat can reduce the number of false positives and to reduce the numberof unnecessary biopsies would be desirable. The n=1005 cohort setincluded 371 patients with nodules between 15 and 29 mm. In the US,patients categorized into that group based on nodule size are followedaggressively because of the higher rate of lung cancer in patients withthis size nodule (e.g., 15 to 29 mm) and because at less than 30 mm,they are not frequently sent to surgery to have the nodule removed. Thepresent blood biomarker algorithm can identify lung cancer patients inthis cohort (15-29 mm) with 63% sensitivity and 77% specificity. Almost100 patients in the n=1005 cohort had nodules less than 15 mm in size.In the US, patients categorized into that group based on nodule size areconservatively managed. The present combined biomarker-clinical factorsalgorithm can identify a sub-population of patients in this group (0-14mm nodules) that have a high risk of cancer with 61% sensitivity and 89%specificity. The use of such algorithm could potentially dictate furtherdiagnostic and/or invasive procedures, such as a CT scan, needle biopsyor tissue resection.

In summary, this case-control study demonstrated that immunoassay markerperformance can be significantly improved with the addition of clinicalfactors and advanced data processing (algorithms). We developed adiscontinuous, multivariate model with biomarkers and clinical variablesthat discriminate between malignant and benign nodules.

1-79. (canceled)
 80. A computer implemented method for predicting alikelihood of having cancer in a patient, in a computer system havingone or more processors coupled to a memory storing one or more computerreadable instructions for execution by the one or more processors, theone or more computer readable instructions comprising instructions for:storing a set of data comprising a plurality of patient records, eachpatient record including a plurality of parameters and correspondingvalues for a patient, and wherein the set of data also includes adiagnostic indicator indicating whether or not the patient has beendiagnosed with cancer; selecting a subset of the plurality of parametersfor inputs into a machine learning system, wherein the subset includes apanel of at least two different biomarkers and at least one clinicalparameter; randomly partitioning the set of data into training data andvalidation data; generating a classifier using a machine learning systembased on the training data and the subset of inputs, wherein each inputhas an associated weight; and determining whether the classifier meets apredetermined Receiver Operator Characteristic (ROC) statistic,specifying a sensitivity and a specificity, for correct classificationof patients.
 81. The computer implemented method of claim 80, furthercomprising iteratively regenerating the classifier when the classifierdoes not meet the predetermined ROC statistic, by using a differentsubset of inputs and/or by adjusting the associated weights of theinputs until the regenerated classifier meets the predetermined ROCstatistic.
 82. The computer implemented method of claim 80, furthercomprising generating a static configuration of the classifier when themachine learning system meets the predetermined ROC statistic. 83.(canceled)
 84. The computer implemented method of claim 82, furthercomprising: configuring a computing device accessible by a user with thestatic classifier; entering values for the subset of the plurality ofparameters corresponding to a patient into the computing device; andclassifying, using the static classifier, the patient into a categoryindicative of a likelihood of having cancer or into another categoryindicative of a likelihood of not having cancer.
 85. The computerimplemented method of claim 84, wherein the category indicative of alikelihood of having cancer is further categorized into qualitativegroups.
 86. (canceled)
 87. (canceled)
 88. The computer implementedmethod of claim 85, wherein the quantitative groups are provided to theuser as a percentage, multiplier value, composite score or risk scorefor the likelihood of having cancer.
 89. The computer implemented methodof claim 84, further comprising providing a notification to the userrecommending diagnostic testing when the patient is classified into thecategory indicative of a likelihood of having cancer.
 90. The computerimplemented method of claim 89, wherein the diagnostic testing isradiographic screening.
 91. The computer implemented method of claim 89,further comprising: (1) obtaining test results from the diagnostictesting which confirm or deny the presence of cancer; (2) incorporatingthe test results into the training data for further training of themachine learning system; and (3) generating an improved classifier bythe machine learning system.
 92. The computer implemented method ofclaim 80, wherein the panel of biomarkers is selected from the groupconsisting of: AFP, CA125, CA 15-3, CA 19-19, CEA, CYFRA 21-1, HE-4,NSE, Pro-GRP, PSA, SCC, anti-Cyclin E2, anti-MAPKAPK3, anti-NY-ESO-1,and anti-p53.
 93. The computer implemented method of claim 92, whereinthe panel of biomarkers includes any two, any three, any four, any five,or any six biomarkers.
 94. (canceled)
 95. The computer implementedmethod of claim 80, wherein the clinical parameters are selected fromthe group consisting of: (1) age; (2) gender; (3) smoking status; (4)number of pack years; (5) symptoms; (6) family history of cancer; (7)concomitant illnesses; (8) number of nodules; (9) size of nodules; and(10) imaging data.
 96. (canceled)
 97. The computer implemented method ofclaim 80, further comprising an input to the machine learning systemcorresponding to a biomarker velocity, wherein the biomarker velocity isdetermined by: (1) obtaining serial values for a biomarker from thepatient; and (2) determining a biomarker velocity for the biomarkerbased upon the serial values.
 98. The computer implemented method ofclaim 80, wherein the plurality of parameters further comprise one ormore parameters from the group consisting of: (a) patient electronicmedical records (EMR); (b) medical literature; (c) images; and (d)geography.
 99. The computer implemented method of claim 80, wherein theclassifier is a neural net, a support vector machine, a decision tree, arandom forest, a neural network, or a deep learning neural network. 100.The computer implemented method of claim 99, wherein the neural net hasany one or more of the following features: (1) at least two hiddenlayers; (2) at least two outputs, with a first output indicating thatlung cancer is likely and a second output indicating that lung cancer isnot likely; and (3) 20-30 nodes.
 101. (canceled)
 102. The computerimplemented method of claim 80, wherein the classifier has a specificityof at least 80%.
 103. The computer implemented method of claim 102,wherein the classifier has a sensitivity of at least 70%. 104.(canceled)
 105. The computer implemented method of claim 80, wherein thecancer is selected from the group consisting of: breast cancer, bileduct cancer, bone cancer, cervical cancer, colon cancer, colorectalcancer, gallbladder cancer, kidney cancer, liver or hepatocellularcancer, lobular carcinoma, lung cancer, melanoma, ovarian cancer,pancreatic cancer, prostate cancer, skin cancer, and testicular cancer.106. (canceled)
 107. A computer implemented method of assessing thelikelihood that a patient has lung cancer relative to a populationcomprising: measuring the values of a panel of biomarkers in a samplefrom a patient; obtaining clinical parameters from the patient;utilizing a classifier generated by a machine learning system toclassify the patient into a category indicative of a likelihood ofhaving cancer or into another category indicative of a likelihood of nothaving cancer, wherein the classifier comprises a sensitivity of atleast 70% and a specificity of at least 80%, and wherein the classifieris generated using a panel of biomarkers comprising at least twodifferent biomarkers, and at least one clinical parameter; and when apatient is classified into a category indicating a likelihood of havingcancer, providing a notification to a user for diagnostic testing.108-146. (canceled)